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Preface 


The process of education involves three steps: (1) determining 
objectives, (2) providing experiences designed to achieve the objec- 
tives, and (3) measuring and evaluating the results to determine 
if the objectives have been achieved. 


Although measurement and evaluation is an important part of 
education, most teacher-training institutions in the United States 
do not require prospective teachers to take a course in the subject. 
Many of these institutions do provide instruction in the subject as 
part of some larger course which includes a unit on measurement 
and evaluation together with units on other aspects of education 
(e.g, principles, methods, curriculum, educational psychology). 
Instructors who teach such units are often reluctant to require their 
students to purchase one of the standard texts on measurement 
and evaluation, since it is difficult to justify the expénse«irt view of 
the relatively small amount of time spent in studying the subject. 


This book has been written to meet the needs of the instructors 
and the students of courses which include a unit on measurement. 
It contains concise chapters on all of the topics which are of most 
importance to classroom teachers. The criterion employed in decid- 
ing what to include was simply—Is this topic important for class- 
room teachers? If the answer was yes, the topic was included. 
Since classroom teachers make more use of measuring instruments 
which they devise themselves than they do of standardized tests or 
inventories, a majority of the space has been devoted to this aspect’ - 
of measurement and evaluation. 


The typical classroom teacher has relatively little need for sta- 
tistics. This phase of measurement therefore has been minimized. 
It has not been neglected, however, since a minimum of statistical 
concepts and techniques necessary for summarizing grades and 
for interpreting scores on standardized tests has been included as 
an integral part of other chapters. 


The book is not an outline on measurement; rather, it is a short 
self-contained text. The annotated bibliography which is included 
describes the major standard-sized texts in the area so that students 
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who wish to pursue a topic more fully may easily locate further 
text materials. 


The material in the book has been used, in a preliminary mimeo- 
graphed version, by several hundred students at Los Angeles State 
College. We are indebted to our colleagues, Professors Prudence 
Bostwick, Marian Wagstaff, Morris Better, Bob Forbes, Robert 
Hahn, Sam Jones, George Kibby, Ray Pitts, and Julian Roth, and 
to their students for their many helpful suggestions. 


We are deeply grateful to Dr. Lucien B. Kinney of Stanford Uni- 
versity for his careful reading of the manuscript in its preliminary 
form, and for his many helpful suggestions. We are also indebted 
to Dr. John A. Dahl of Los Angeles State College for reading the 
manuscript in its final form. 


uu E. W. 
: G. W. B. 


San Gabriel, California 
April 29, 1957 
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CHAPTER 1 


Introduction to Educational 
Measurement and Evaluation 


Meaning of measurement and evaluation 


The word measurement means “the act or process of ascertaining 
the extent or quantity of something.” Evaluation refers to “the 
act or process of determining the value of something.” Evaluation 
depends upon, but is not synonymous with, measurement. Evalua-_ ; 
tion goes beyond measurement in answering the question: Is the 
obtained measure desirable or undesirable? 

Courses in the subject covered by this book have, in previous 
years, been referred to as Tests and Measurements, or simply Meas- 
urement. "The emphasis was on tests and the statistical manipula- 
tion of the test results. In recent years the scope of such courses 
has been broadened to include many non-test techniques, such as 
observation, sociograms, and anecdotal records, which are used to 
supply a more complete picture of the pupil—his status and his 
progress. The word evaluation has become associated with this 
broadened scope and infers the use of non-test techniques as well 
as tests. 

When a tire gauge registers twenty-four pounds of air pressure 
in a tire, this constitutes a measurement and of itself indicates a 
situation neither desirable nor undesirable. If the recommended 
pressure is twenty-four pounds, “everything is as it should be." On 
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the other hand, if normal inflation is thirty pounds, then “some- 
thing is wrong.” Deciding that “something is wrong” is an evalu- 
ation based on the evidence obtained from the measurement. The 
evaluation continues as possible causes for the undesirable situa- 
tion suggest themselves: (1) The gauge may be wrong; (2) there 
may be a leak in the tire; (3) someone may have let air out of 
the tire; and (4) the air pressure may be low for some other reason. 
When the reason for the low reading is discovered, appropriate 
action is undertaken. 

In education much the same kind of process as that described 
above occurs. A pupil has a reading grade placement of 2.8—that 
is, he reads as well as the average pupil in the eighth month of 
the second year of school. This fact represents evidence obtained 
through a measurement and is neither desirable nor undesirable in 
itself, If the pupil is in the fifth grade and of normal intelligence, 
the teacher knows that “something is wrong.” The teacher then 
seeks possible reasons for the discrepancy between the pupil’s actual 
reading level and the level indicated by his grade placement and 
intelligence. Appraising the evidence obtained from the measure- 
ment of the pupil’s reading ability is part of the evaluation process. 


Uses of measurement and evaluation in guidance 


Our system of education recognizes the fact that all pupils are 
different and will play different roles in society. Therefore, 
although our society determines the general objectives of education, 


, Specific objectives are influenced by the capabilities of the individ- 


ual pupil. Determining what objectives are reasonable for pupils is 


4 the responsibility of educators, parents, and the pupils themselves. 


This aspect of education is referred to as guidance. A sound choice 
of objectives depends upon sound information about the pupils’ 
abilities, interests, attitudes, and character, This information is 


obtained through use of the techniques of measurement and 
evaluation. 


Guidance is concerned with the answers to such questions as: 
Should Johnny repeat the fourth grade? 


Should Mary take a college preparatory major or a business 
major? 


Should Bill elect woodshop or orchestra? 
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Should Ann be transferred from Miss Smith’s room to Mrs. 
Jones’ room? 


Uses of measurement and evaluation in instruction 


Within the individual classroom each teacher utilizes measure- 
ment and evaluation for one or more purposes. 

1. To reveal the stage at which pupils have arrived in the 
learning process. 

All teachers find it is necessary to pause now and then to survey 
the job confronting them. This survey or evaluation may occur 
during the initial stages of a semester, or unit of work, to deter- 
mine at what level the instruction should start and the necessity 
for review of previous learnings. Evaluation at this stage allows 
for an investigation of the spread of ability in the class, thus 
identifying gifted students capable of an enriched program as 
well as the pupils who need remedial work. 

Evaluation part-way through the unit of work permits the 
teacher to appraise the extent to which pupils are progressing 
toward the goals of instruction. Appraisal here allows’ one to 


determine whether the instructional pace can be accelerated or. 


whether some reteaching is necessary. 

Evaluation at the conclusion of a unit of instruction occurs 
almost automatically in most classrooms. Here again the purpose 
is to determine the stage at which the pupils have arrived. Results 
of the evaluation indicate whether a satisfactory level of achieve- 
ment has been reached and/or identifies areas necessitating re- 


teaching and review. Information relative to readiness for the 3: 


next topic can also be obtained from the appraisal at the end 
of a unit. : 

In the process of discovering the pupil's current status, it is 
necessary to reveal sufficient information to the class so that stu- 
dents may engage in appropriate seltappraisal. In this way the 
evaluation process may serve to motivate pupils to do better work. 
Motivation alone does not constitute sufficient cause for evaluation, 
and those who use tests primarily for this purpose are probably 
utilizing inadequate teaching methods as well as inadequately 
utilizing the test data. 

An ideal teachinglearning situation often develops after an 
evaluation technique has been used. Since the pupil has some ego- 
involvement in his response to a test item, a discussion of these 
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responses finds pupils eager to defend their stand, with effective 
class interaction resulting. The alert teacher uses these discussions 
to dispel pupil misconceptions, identify misunderstandings, and also 
to identify poor test items. 

2. To determine the effectiveness of instruction and planned 
activities. 

Success in the classroom depends to a large extent upon the ade- 
quacy of the teacher’s plan for class activities. Unless there exists 
within the plan provision for evaluation, the effectiveness of these 
planned activities remains a mystery. Therefore, teachers are found 
reusing techniques and methods which, unknown to them, are 
extremely ineffective. At the same time, particularly strong tech- 
niques may be discarded for lack of validation. 

Professional people are personally responsible for their own pro- 
fessional growth—that is, improving and validating their methods. 
Evaluation serves a real purpose in providing them with techniques 
for doing so. As an example, a junior high school teacher recently 
became interested in group dynamics as a teaching technique. He 
decided to use this approach in a unit. He identified specific objec- 
tives, organized materials, and planned his procedures. At the close 
of the unit, through the employment of recognized evaluation tech- 
niques, he discovered the group dynamics approach particularly suc- 
cessful for his purposes. The results encouraged him to use the 
technique further, but to adjust it slightly to better fit the local 
school population. In the process he also identified areas of instruc- 
tion which needed further study, as well as individuals in the class 
who would benefit from remedial work. 

3. To serve as a basis 
progress. 

Almost all schools require teachers periodically to summarize and 
report pupils' progress. "These summaries are recorded in perma- 
nent records of the school and are reported to parents either in the 
form of report cards, written reports, or other ways. Often the sum- 
maries are stated in grades or marks. Parents utilize these summary 
reports as an aid in understanding and guiding their children. 
School personnel utilize the Summary reports as an aid to guidance 
and as a source of information when questions regarding pupils 
arise in connection with enrollment in advanced or remedial classes, 
job placement, and entrance to college. Since crucial decisions are 


a based on summary reports, the reports should be fair 
and accurate. Through the use of measurement 


for summarizing and reporting pupil 


and evaluation 
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techniques, teachers can obtain sufficient information to achieve 
this end. 

4. To throw light on the feasibility and practicability of stated 
objectives. 

An appraisal of evaluative data sometimes results in review of 
aims rather than of procedures. Consider the zealous elementary 
arithmetic teacher who attempted to teach the multiplication proc- 
ess to the extent that all students achieved 100-percent accuracy. 
Through careful research, she acquired information on the best 
teaching techniques and the most effective teaching materials. After 
adapting both the techniques and materials to fit her own situation, 
she proceeded to teach multiplication. After a reasonable period of 
time, a testing program disclosed that the pupils had not achieved 
100-percent accuracy in multiplication. She then provided remedial 
sessions, modified her planned experiences, and devised new activ- 
ities. At the conclusion of these experiences, although much prog- 
ress had been made, she was still short of the goal. In evaluating 
the results of her efforts, she might well conclude that since she had 
confidence in her own ability and since the techniques and materials 
were carefully selected, possibly the chosen goal was unrealistic and 
impractical in view of the time and effort required to achieve it. 


Relationship between objectives, activities, and evaluation 


The relationship between evaluation and the entire instructional 
process is revealed by examining the steps in the instructional proc- 
ess. These are: 


1. Establishing objectives to be achieved. 


2. Providing experiences and activities expected to contribute to 
the achievement of the objectives. 


9. Evaluating to be sure that the desired results have been 
achieved. 


These three steps can be utilized to systematize and organize a 
unit in a course or even an entire course. A plan sheet set up with 
three columns, each headed by one of the three instructional steps 
can provide the framework for course development that will guaran- 
tee consideration of the key processes. An example of a partial 
framework for a seventh-grade arithmetic course appears on page 6. 


Table 1 Partial Framework for 7th Grade Arithmetic Course 


Objectives 
The pupil: d 
1. Exhibits an appreciation for mathematics 


Ui Working on mathematical recreations 
during his leisure time. 

1.2 Asking questions relative to uses of 
mathematics in our society. 


2. Exhibits insight into and understanding 
of the mathematical processes. 
2.1 Can explain the rationale behind the 
processes. 
2.2 Can explain the relationship between 
the processes—such as division is a short 
cut for subtraction. 


3. Exhibits facility in the processes of arith- 


metic. 
3.1 Can compute with reasonable skill 


and accuracy using whole numbers, frac- 
tions, and decimals. 

3.2 Can work problems involving per 
cents. 


4. Can apply arithmetic skills to life situ- 
ations. 
4.1 Solves problems that arise in his own 
experiences. 
4.2 Can read and construct graphs and 
tables with understanding. 


Activities 


The teacher: 
1. Will introduce puzzles and recreational 
materials to the class and encourage them 
to work оп them. |, 

Organize field trips to industries which use 
mathematics extensively. 


2. Will develop mathematical ideas through 
the use of concrete experiences. 

Will build new ideas upon concepts al- 
ready understood, and will provide many 
opportunities for pupils to explain the “why” 
of the processes. 


3. Will provide many opportunities for 
pupils to work exercises, play mathematical 
games, and drill on weaknesses in compu- 


tation. 


4. Will assign problems which are of interest 
to children at this age. 

Will provide problems which grow out of 
activities in other classes such as social 
studies, shop, and physical education. 


Evaluation 


1. Observation using check 
list and anecdotal records. 


2. Observation—the teacher 
will listen as pupils "think 
aloud" in working problems. 


3. Written — tests—standard- 
ized, teacher-made, diagnos- 
tic. 


4. Written tests with word 
problems. Observation. In-, 
terviews. 
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Note that the second and third columns above can be determined 
logically when the objectives in the first column are stated behavior- 
ally, and further that the specific evaluation techniques relate 
directly to the particular pupil behavior desired. 


The key steps in the evaluation process 


In the process of evaluation, the teacher or evaluator raises four 
questions which determine the steps of the process. 


l. What would he see if the objectives were realized? This step 
involves stating specific objectives and emphasizes that when 
objectives are realized, some observable evidence must be avail- 
able. 


2. Where and in what situations would he see the evidence? The 
place and time that the evaluator would note this observable evi- 


dence must be identified. uoke 

3. How can he get a record of the evidence? The process for col- 

lecting and organizing the evidence must also be identified. 

4. How would he appraise the evidence, or what is its signifi- 

cance? This final step asks what the evidence means and implies 

that some action is taken in light of it. 

The manner in which these four questions are answered is illus- 
trated by the following examples. 

A young man wishes to buy a used car and locates one which out- 
wardly seems to fit his needs. Before buying he investigates many 
aspects of the car, or to put it another way, he evaluates it. 

First, he asks, what does he want in a car, or what would he see 
if this were the car he wanted? The answers to these questions 
make up his objectives. He might want such things as: 

1. Efficient performance. 

2. Attractive appearance. 

3. Accessories. 

4. Appropriate sales price. 

Secondly, he asks, where would he see evidence related to these 
objectives? The careful buyer would not accept the advertisement 
or sales talk at face value. He would turn directly to the automo- 
bile to collect most of this evidence. Some of the characteristics, 
such as the paint, he could observe while the car was in the car lot. 
In other cases he would plan a situation to obtain evidence. For 
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example, to determine performance, he would drive the car and in 
the process might choose to drive up a steep hill or over a rough 
street, or perhaps he would drive it for some distance in second 
gear. Note that to collect some evidence he contrives a situation 
which is not the usual. 

For the third step he needs to record the evidence he has decided 
to collect and asks himself how can he economically and efficiently 
collect these data. Some of his questions could be answered by 
direct observation, and he could record these responses on à check 
list or by making a note of his findings. Evidence relating to gas 
mileage or oil consumption can be obtained by measuring the gaso- 
line and oil before and after a trip. We might note further that if 
he is skilled and has the correct equipment, he might perform one 
or two tests and obtain some of the information much more effi- 
ciently. For example, a compression check might indicate in a few 
minutes more information about the car’s gas and oil consumption 
than the buyer could discover in a 100-mile trip. This is analogous 
to many educational tests which provide a short-cut method of 
collecting data on pupils. 

To be useful, the collected information must not only be recorded 
but organized. The car buyer looks at several cars and wishes to 
choose among them; he will need records that permit a valid com- 
parison. A check list might be one method for the car buyer to 
record and organize his data. 

Finally, when he has collected all the information he can in the 
time allowed, he must appraise the data and make a decision. He 
either buys the car, bargains for an adjustment of price, or rejects 
it entirely. Note here that he attaches a value to the evidence he 
obtained and takes appropriate action. 

To relate these steps specifically to educational evaluation con- 
sider the following illustration. 

A teacher assumes the goal for her class, “the development of 
good work habits." This very general objective might be included 
in the list of aims for almost any class or course. She then asks the 
question: What would she see if her pupils displayed good work 
habits? After some consideration, she listed the following: 

l. Promptness in reporting to class 

2. Bringing books and other school supplies to class. 
8. Completing assignments on time. 

4. Organizing the job to be done. 

5. Efficient budgeting of time. 
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What she actually did was to express the objective, “the develop- 
ment of good work habits,” in terms of observable pupil behavior. 
Expressing objectives in this way is often referred to as the opera- 
tional definition of aims. 

As a second step, situations are planned where this pupil behavior 
may be observed. Such thing’ as reporting to class promptly and 
bringing books can be observed during the normal course of school 
work. To determine whether pupils complete assignments on time 
the teacher may assign a variety of jobs to be completed, including 
committee reports, individual projects, and the like. Actually, all 
teacher assignments serve in part as contrived situations from which 
evidence can be obtained for evaluation purposes. 

As a natural next step, the teacher will determine how she can 
obtain a record of the evidence. 

Promptness in reporting to class can be observed and recorded. 
Also, a periodic check will reveal whether pupils have brought their 
books and other school supplies. Some record should be made of the 
occasions when a pupil has not provided these items. Failure to 
hand in school assignments on time should also be systematically 
recorded each time it occurs. 

Finally, when the teacher has recorded the evidence on work 
habits, she is ready to appraise the evidence and act accordingly. 
The action taken in this case might take several forms. Probably 
the data for each student will need to be summarized, a mark 
assigned, and a report made on a form or card. Perhaps a reorgan- 
ization of the instructional approach to supply additional training 
in the development of work habits is necessary. It may be that 
insufficient data were collected, implying a need to revise the evalua- 
tion techniques. In any event, the point is that some appropriate 
action must be taken in light of the evidence obtained; otherwise 
the work and planning involved in collecting the evidence has no 
purpose. 


Essential characteristics of measurement procedures 
used in evaluation 


Although measurement is never an end in itself, sound measure- 
ment is a prerequisite to sound evaluation. Correct decisions cannot 
be made based on faulty evidence. Regardless which particular 
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measurement technique is employed, there are three questions 
regarding the technique which should be answered affirmatively: 

1. Does the technique obtain valid evidence? 

2. Does the technique obtain reliable evidence? 

3. Is the technique practical and economical? 

The degree to which a measurement technique obtains the kind 
of evidence which its user intends it to collect is the measure of its 
validity. A test composed of computation items in sixth-grade arith- 
metic is a valid test of achievement in sixth-grade arithmetic com- 
putation. However, it is not a valid test of the pupils’ ability to 
apply the computation to the solution of word-problems. A valid 
test of this ability would include items requiring the solution of 
word-problems.* 

The degree to which a measurement technique obtains accurate 
and consistent evidence is the measure of its reliability. For exam- 
ple, if a desk is measured with a yardstick by two different persons, 
it is reasonable to expect both persons to arrive at very similar 
answers. However, if a child's intelligence is determined by two 
different psychologists on succeeding days, there may be a 10- or 
15-point difference between the I.Q.’s determined by the two psy- 
chologists. Since the intelligence of the child does not change over 
such a short period, the difference in the 1.Q.’s must be attributed to 
lack of precision in the measuring instrument. The terms precision 
and consistency are actually two words which express the same idea. 
Thus, if a yardstick is used to measure height to the nearest half 
inch, then it is reasonable to expect that the measures would be 
precise and the heights of the students obtained by one teacher 
would be consistent with the heights obtained by a second teacher.” 

The third essential characteristic of a measuring instrument 
relates directly to its usefulness. Obviously an evaluation technique 
must be sufficiently economical, costwise, so that the school can 
afford it, and also timewise, so that the teacher, with all her respon- 
sibilities, can carry it through. It is for these reasons that paper- 
and-pencil tests have achieved their popularity. Tests can be admin- 
istered to groups, conveniently scored, and interpreted according to 
a standard. However, as stated previously, many objectives cannot 


be evaluated by means of a written test, and thus many other tech- 
niques for evaluation have been developed. 


? For a further explanation of validity see Appendix B, page 109. 
? For a further explanation of reliability see Appendix B, page 110. 
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Evaluation is comprehensive and continuous 


Evaluation is comprehensive because evidence is obtained regard- 
ing pupils’ abilities, interests, health, adjustment, achievement, 
character—in fact, every aspect of the total personality. This evi- 
dence is used to guide pupils and to judge pupils’ progress. It is 
also used to evaluate the quality of the educational program 
offered to the pupils. 

Evaluation is continuous because every action of the pupil is 
a part of the evidence which the teacher gathers in order to better 
understand the pupil. Evaluation is not limited to the weekly test 
or the final examination. Every question the pupil asks, every 
assignment the pupil completes, in short, everything which he 
does, in and out of classroom, contributes to the total evi- 
dence which the teacher gathers. To collect the many different 
kinds of evidence requires the use of a variety of measurement 
techniques. The most common and useful of these techniques are 
discussed in the following chapters of this book. 


EXERCISES 


1. What are some measurable pupil behaviors which you consider 
to reflect “good citizenship" at the third-grade level? At the 
twelfth-grade level? 


2. What are the dangers in attempting to evaluate "good citizen- 
ship" without translating it into pupil behavior? 


3. State some general objective in your teaching area and translate 
the general objective into specific measurable pupil behavior. 


4. What reasons might account for the fact that a child in the fifth 
grade tests at the second-grade level in reading? How would you 
determine which of these possible reasons was actually correct? 


5. Select thrce or four objectives from a course of your choice and 
fill out the following plan sheet. In the objective column, state 
the objectives in terms of specific measurable pupil behaviors. 
In the activities column, list the activities which you would pro- 
vide to accomplish the objectives. In the evaluation column, list 
the various methods by which you could determine whether 
the objectives had been achieved. 
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For this and other chapters see the annotated bibliography, pages 
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CHAPTER 2 


Evaluating Achievement with 
Teacher-devised Short-answer Tests 


The short-answer type of test cons 
which can be answered by selectin 
number of possible answers suppli 
supplying the correct 
symbols. Items of this type are sometime: 


ists of a collection of items 


5 called objective items 
the pupil's answers may be 


nly one of many measurement tools uti- 
lized by the teacher in the evaluation process. In the course of a 
semester's work, the teacher may utilize short-answer tests, essay 
tests, discussions, Observations, term papers, oral reports, and other 
means of evaluating her pupils' progress as well as her own teaching 
effectiveness. Before constructing a short-answer test, therefore, it is 
necessary to decide whether this type of test is the proper measuring 
device to use. Sometimes, after consideration, it will be apparent 


that a short-answer test is not appropriate for evaluating the par- 
ticular objectives in question. 


Planning the short-answer test 


After deciding that a shortans 


wer test is appropriate for meas- 
uring the objectives under consi 


deration, the next step is to plan 
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the test. At this stage the objectives to be measured by the test must 
be considered. Although tests are sometimes written to evaluate a 
single objective of instruction, they usually cover several different 
ones. When this occurs, it is necessary to decide what portion of the 
test shall be assigned to each of the separate objectives. Unless this 
is done, the portions of the test devoted to the various objectives 
may be completely out of line with the relative emphasis which 
should be placed on each of the objectives. Consider a social 
studies unit on South America which has as one of its objectives the 
knowledge of the principal products of the various countries. 
Although a knowledge of these products might constitute only 10 
percent of the stated objectives of the unit, a test on the unit might 
include an excessive number of items measuring this objective be- 
cause of the relative ease with which such items can be constructed. 

Both the relative weight of the various objectives of the unit or 
material covered by the test and the relative weight of the various 
content areas covered by the unit must be considered in writing the 
test. If in the social studies unit cited 10 percent of the time and 
effort of the class has been devoted to studying Argentina, it would 
seem reasonable to expect approximately 10 percent of the items on 
the test to deal with Argentina. A test which either neglected 
Argentina entirely or included 30 percent of the items on Argentina 
would not be reflecting the time assigned to this aspect of the unit. 

Professional test-makers often develop a "blueprint" for the test 
which specifies the exact percentage of items according to content 
and objective. The teacher usually will not go into as much detail 
as the professional test constructor. There is, however, a clear neces- 
sity to plan the emphasis in the items according to the objectives to 
be measured and the content to be included. In no event should 
the test-maker just start writing items. Since items are easier to 
write in some areas than in others, a test constructed in this way can 
only by the very sheerest coincidence correspond to the test which 
would have been devised by a prior consideration of the objectives 
and content. It is not necessary to spend the time and effort on an 
elaborate plan or “blueprint” for each test. A simple test specifica- 
tion, such as the one on page 16, will result in a test far better than 
one written with no specifications. 

The specifications for the fifty-minute arithmetic test can also be 
represented by the "blueprint" on page 16, which combines the 
information regarding objectives and content. 
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Table 2 Specifications for a Fifty-minute Arithmetic Test 
Allocation of items Items measuring ability to 
by objective compute. 40%, 
Items measuring ability to 
use computation in 
problem solving (word 


problems) 60% 

Allocation of items Fractions 25%, 
by content Decimal fractions 25%, 
Percents 50%, 


a 

It is obvious that more detailed specifications could be written. 
For example, the 25 percent allotted to fractions could be subdi- 
vided into addition of fractions, multiplication of fractions, and so 
forth. Such refinement in the specifications would be more desir- 
able and would result in an improved test. However, any specifica- 
tions are better than no specifications at all, and all test-makers 


should make some attempt at “blueprinting” their tests before writ- 
ing test items. 


Table 3 Blueprint for a Fifty-minute Arithmetic Test 
eee 


Objectives 
Content Computation Solving Word-problems 
(40% (60%) 
а, 

Fractions 

(25%) 10% 15% 

Decimal fractions 

(25%) 10%, 15% 
Percents 

(50%) 20% 30% 


In addition to the objectives and content of the test, the test- 
maker must also think of the length and the desired difficulty. The 
length of the test will depend on the amount of material to be cov- 
ered and the extent to which other measures will be available. Ifa 
test is given every few days, each test may be rather short. On the 
other hand, if an entire semester's 
or three tests given during the se 
be long and comprehensive. 


grade is to be based on only two 
mester, each test will, of necessity, 
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The difficulty of the items should be determined by the purpose 
of the test. If the test is being given to determine whether the pupils 
have mastered minimum essential learnings, it will not matter if 
the items are easy and all of the pupils receive high scores. How- 
ever, if the test is being given to'determine which pupils know most 
about the subject, and which know the least, it will be necessary to 
include some difficult items; otherwise, there will be no way to dis- 
tinguish between the good pupil and the poor one. х 

After the objectives; content, difficulty, and length of the pro- 
posed test have been determined, the items are written. At this 
point the test-maker has the choice of a wide variety of item forms. 
Only by being familiar with the various types of items and knowing 
their advantages and limitations can the item-writer decide which 
type or types of items to employ in any given test. 


Selection-type items and supply-type items: Two basic types 


Although there are many different types and variations of items, 
they can be divided into two major types—selection-type items and 
supply-type items. Selection-type items require the pupil to select 
a response from several alternatives supplied by the test-maker. 
Supply-type items require the pupil to provide a word, phrase, or 
number for the answer. Multiple-choice, matching, and true-false 
illustrate the selection-type items. Direct questions and comple- 
tion items constitute the supply-type item. "The essay item is actu- 
ally a form of supply-type item although it is usually considered 


a different type and will be discussed separately in Chapter 2. 


Multiple-choice items 


А The multiple-choice item consists of either a question or an 

incomplete statement followed by two or more possible answers 

to the question or completions of the statement. These possible 

answers are referred to as responses. The question form of the 

multiple choice item is illustrated by the following example: 
Who was President of the United States of America in 1955? 

. John Dulles. 

. Dwight D. Eisenhower. 

. Richard Nixon. 


1 
2. 
3 
4. Charles Wilson. 
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This same item in the incomplete statement form would read: 

The President of the United States of America in 1955 was 

1. John Dulles. 

2. Dwight D. Eisenhower. 

3. Richard Nixon. 

4. Charles Wilson. 
For this particular item, there is no advantage to either the ques- 
tion form or the incomplete statement form of item. 

A fundamental requirement of a multiple-choice item is that 
the stem, the name given the question or incomplete statement, 
pose a distinct problem. Since the question form of item forces 
the writer to formulate a complete thought, this form of the mul- 
tiplechoice item is more appropriate for use by inexperienced 
item writers. Using the incomplete statement form of item can 
result in an item which is nothing more than a series of true-false 
statements as the following item illustrates. 

The President of the United States of America in 1955 was 

l. a Republican. 

2. formerly a Supreme Court Justice. 
3. formerly an officer in the Navy. 

4. a bachelor. 


This kind of item should be avoided. Unless there is clearly 
one central problem, the multiple-choice form of question is not 
appropriate. 

The examples of multiple-choice items given thus far have had 
four choices, have had one correct answer, and have called for a 
knowledge of factual material to determine the correct response. 
None of these conditions is necessary to a multiple-choice item. 
A multiple-choice item may have as few as two choices or as many 
as the item-writer can devise. The reason for usually having four 
or five choices is that it reduces the chances of a pupil's guessing the 
right answer. However, each incorrect response, called a distractor, 
should be plausible to a person who does not know the correct 
answer. When multiple-choice items are written for use with young 
children, there should only be two choices, the correct answer and 
one incorrect answer. 

The most common form of the multiple-choice item is that 
which calls for one right or best answer. A variation sometimes 
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used is to vary the number of correct answers from item to item, 
and require the student to mark all correct answers. When this 
variation is used, it is possible for no answer to be right or for one 
or more of the answers to be right. The following item illustrates 
this variation of the multiple-choice item: 

Which of the following geometric figures contains four right 
angles? , 

1. Square. 

2. Circle. 

3. Rectangle. 

4. Equilateral triangle. 
Items of this type can be marked by giving one point for each cor- 
rect answer and one point for each incorrect answer which is not 
marked. In the preceding example, the student who marked 
responses 1 and 3 would receive the maximum score of four points. 
The student who marked responses 2 and 4 would receive no points. 
This type of item can also be scored on an all-or-none basis; that is, 
if responses 1 and 3 were marked the answer would be right; any 
other combination of marks would result in no credit. 

The multiple-choice item is the most versatile form of short- 
answer item. It can be used to measure skill, knowledge, under- 
standing, and application. Short-answer tests, including multiple- 
choice tests, have been criticized as measuring only factual out- 
comes. If the item-writer has a clear picture of what understanding 
or application is to be tested, and is willing to take the time and 
effort to develop good items, this objection can very easily be 
overcome. 

One method of measuring understanding is to provide the pupil 
with written or pictorial materials which pose a new and realistic 
problem situation, and then present him with objective (usually 
multiple-choice) items which test his ability to apply school-learned 
skills in solving these new problems. Although this type of item 
takes time to write, it can be used effectively to measure under- 
Standing. It has been used extensively in the Sequential Tests of 
Educational Progress. On the following pages are illustrations of 
items’ from this series of tests in the areas of reading, writing, math- 
ematics, science, and social studies. For further illustrations of this 


» Quoted from A Prospectus for the Sequential Tests of Educational Progress 
with the permission of the Cooperative Test Division, Educational Testing 
Service, Princeton, N. J., and Los Angeles, Calif. 


20 Achievement with teacher-devised short-answer tests 


type of item see The Measurement of Understanding edited’ by 
N. B. Henry 


Samples of Reading Comprehension Test Material 
(Grades 4-6) 
Dear Bill, 

It was fun to be on the farm. Yesterday morning, Jack and I 
watched Aunt Mary make butter. She did not need to use all her 
cream to make butter. She sent most of the cream to the creamery. 

I wish I were a farmer. I would take just a little cream for but- 


ter. Then I would use all the rest of the cream to make ice cream. 
Wouldn't that be fun? 


I'm sorry you could not go to Jack's farm with me. I had the 
time of my life. Every day, Jack kept finding some new thing to do. 


I came back to town yesterday. I must say good-bye for now. 
Write soon, 


Your cousin, 
Betty 
41. In this letter, Betty is trying to tell 
A how to make butter. 
B what she did at the farm. 
C what horses eat. 
D how much noise a hog makes. 


42. In the first part, Betty tells about 
E how the creamery makes butter. 
F Betty and Jack making butter. 
G where cream comes from. 
H Aunt Mary making butter. 

43. Which of these things that Betty said tells best how she feels 
about living on a farm? 
A We worked around the barn. 
В І came back to town yesterday. 
C I wish I were a farmer. 
D We rode Jack’s horse. 


*Forty-fifth Yearbook, Part 1, National Society for the Study of Education. 
Chicago: The University of Chicago Press, 1946. 


— X: 
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44. The letter is happy except where Betty is 
.E saying Bill couldn't come. 
F telling about riding the horse. 
G having to say good-by. 
H telling about the cream. 


45. Where does Betty live? 
A In the mountains. 
B On a farm. 
C Near the ocean. 
D In a town. 


Samples of Writing Test Material 
(Grades 10-12) 
M» Favorite Magazine 


1 Many young people of today are taking a great interest in the 
magazine Suburbia. 2 It is of a fairly large size with a considerable 
number of pages. 3 The publishers, Allen and Watts, are well 
known and reputable; thus providing young homemakers with ideas 
and practical plans for their present and future homes. 4 Because 
it also contains articles of family and community relations and hints 
on home improvement, it is most likely preferable reading to people 
who are seeking guidance on such matters. 5 The previously men- 
tioned content explains why the advertisements would logically be 
about products and furnishings for the home. 6 The articles were 
well written, and all the features of Suburbia help to form an inter- 
esting and informative magazine. 7 I found one particularly inter- 
esting article, it was entitled “Families Are Using Spare Time to 
Broaden Their Horizons.” 8 This article points out that Icisure 
time should be spent in any activity other than usual work, instead 
of remaining in complete inactivity. 


8. In Sentence 2, how could the size of the magazine be indi- 
cated most effectively? 
E By comparing it with one or two well-known magazines. 
F By giving length, width, thickness, weight, and number of 
pages. 
G By drawing a scale model. 
H By telling how many articles each issue contained. 
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9. As Sentence 3 is now written, which way of punctuating it is 
most acceptable? 


A reputable; thus (As it is now) 
B reputable: thus 

C reputable. Thus 

D reputable, thus 


10. Which of these revisions of Sentence 3 is best? 


E Since the publishers, Allen and Watts, are well known and 
reputable, they provide young homemakers with ideas and 
practical plans for their present and future homes. 

F The publishers, Allen and Watts, provide young home- 
makers with ideas and practical plans for their present and 
future homes; hence they are well known and reputable. 

G Allen and Watts are well known, providing young home- 


makers with ideas and practical plans for their present and 
future homes, as reputable publishers. 


H The publishers, Allen and Watts, who are well known and 
reputable, provide young homemakers with ideas and prac- 
tical plans for their present and future homes. 


11. Sentence 5 is awkward. Which of the following revisions is 

best? 

A Advertisements are about products and furnishings for the 
home because this is logical in a magazine of such content. 

B The advertisements would logically be about products and 
furnishings for the home, like the articles. 

C In keeping with the content of the articles, the advertise- 
ments are about products and furnishings for the home. 

D Because the content of the articles, as previously explained, 


are about homemaking, so the advertisements properly are 
also. 


Since this report is just one paragraph, with which sentence 
should it stop? 


E Sentence 5. 
F Sentence 6, 
G Sentence 7. 
H Sentence 8. 
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Samples of Mathematics Test Material 
(Grades 4-6) 


Situation: In Tom’s school, some children ride bicycles, some walk 
to school, and some ride the school bus. The pupils on the safety 
patrol have to come early. 


1. Two children from each class in the school were members of 
the safety patrol. To find how many patrol members there are 
altogether, what other factor would it be necessary for you 
to know? 


A The number of children in the school. 
B The number of classes in the school. 
C The number of children in each class. 


D The number of street crossings. 


(Grades 7-9) 


4 Situation: Mrs. Cain went to the power and light company to check 
R on her electric bills and obtain information about electrical equip- 
ment. 


1. Mrs. Cain wanted to buy an electric blanket. The office man- 
ager of the electric company told her that the blanket would 
cost 3 cents a night to use. If she used it 200 nights out of the 


| 365 nights of the year, the yearly cost would be 
j А $ 4.95. 

| В $ 6.00. 

| C $10.95. 

К D $60.00. 


2. Mrs. Cain’s house has 4 electrical circuits. The power com- 
р pany recommends having one circuit for each 500 square feet 
of floor space. On this basis, how many additional circuits 
should Mrs. Cain have installed? 
EI 
F2 
G4 


H cannot be determined from the information given. 
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(Grades 10-12) 
Situation: Mr. Jones has a dairy farm on which he also grows corn. 


1. Mr. Jones has two fields of equal size on which he grows corn. 
If 7/8 of field I and 8/9 of field II are devoted to corn, which 
one of the following statements is true? 


A Field I has more space devoted to corn. 

B Field II has more space devoted to corn. 

C Equal space is devoted to corn in both fields. 

D The amounts of space devoted to corn cannot be compared. 


2. Mr. Jones said that 3/4 of his cows were Jerseys, but only 2/3 
of his neighbor’s cows are Jerseys. If the neighbor’s herd is 


larger than Mr. Jones’, which one of the following statements 
is true? 


A They have the same number of Jerseys. | 
B Jones has more Jerseys. \ 
С The neighbor has more. 


D It cannot be determined who has more Jerseys. 


(Grades 13-14) 


Situation: The Mill City Statistical Agency conducts opinion polls | 
and surveys and performs related statistical research, 


l. A new interviewer for the agency reported at the end of his | 
first day that he had interviewed 100 people. He said that 42 
of these were men, of whom 30 were Democrats, and that 49 
were Republican women. The agency needed to know how 
many Democratic women he had interviewed but all he could 
remember was that everyone had been either Democratic or 
Republican. From this information alone, it is possible to 
determine that 
A the data are contradictory. | 
В there were 9 Democratic women. 
C there were 19 Democratic women. 


D there is still insufficient data for an answer. 


In the group interviewed in the preceding question, 30% of 
the persons were Democratic men. Another interviewer 


| 
| 


- 


і 
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reported 20% Democratic men in a second sample. If their 
reports are combined, then the percent of Democratic men 
A is 50%. 
B is 25%. 
C cannot be computed without knowledge of the size of the 
second sample. 
D is unknown because the figures are contradictory. 


Samples of Science Test Material 
(Grades 4-6) 


Situation: Tom wanted to learn which of three types of soil—clay, 
sand, or loam—would be best for growing lima beans. He found 
three flowerpots, put a different type of soil in each pot, and 
planted lima beans in each. He placed them side by side on the 
window sill and gave each pot the same amount of water. 


A o du 


LOAM CLAY SAND 


The lima beans grew best in the loam. Why did Mr. Jackson 
say Tom's experiment was NOT a good experiment and did NOT 
prove that loam was the best soil for plant growth? 


A The plants in one pot got more sunlight than the plants in the 
other pots. 

B The amount of soil in each pot was not the same. 

C One pot should have been placed in the dark. 

D Tom should have used three kinds of seeds. 


(Grades 7-9) 


Situation: 'Tom planned to become a farmer and his father encour- 
aged this interest by giving Tom a part of the garden to use for 
studying plant life. 

Tom wanted to find out what effect fertilizer has on garden 
plants. He put some good soil in two different boxes. 'To box A he 
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added fertilizer containing a large amount of nitrogen. To box B 
he added fertilizer containing a large amount of phosphorus. In 
each box he planted 12 bean seeds. He watered each box with the 


same amount of water. One thing missing from Tom's experiment 
was a box of soil with 


A. both fertilizers added. 


B neither nitrogen nor phosphorus fertilizers added. 
C several kinds of seeds planted. 
D no seeds planted. 


(Grades 10-12) 


Situation: You and your family are visiting the Grand Canyon 
National Park in Arizona. The canyon, one of the geological won- 
ders of the world, is a gigantic gorge as much as 18 miles across 
and a mile deep. At the bottom of this gorge the Colorado River 
is now flowing through an inner gorge of extremely ancient meta- 
morphic rocks, which are covered by thousands of feet of varied 
sedimentary formations. 

Upon reaching the bottom of the canyon, you find the Colorado 
River extremely turbulent and muddy. To determine how much 
mud and other eroded material is in the water, 


it would be best 
to take measured samples of the river water and 


A. determine the average molecular weight of the samples. 
B evaporate the water and weigh the residue. 


C add reagents to precipitate dissolved minerals, filter, 


and weigh 
the residue. 


D filter, evaporate the water, and weigh the residue from the 
filtrate. 


(Grades 13-14) 


Situation: The Alpha Uranium Company is organized to prospect 
for, obtain, and refine uranium ores. 

As a protection against radiation injuries, a check is needed to 
determine whether plant employees have been exposed to too much 
radiation. Which of the following safety procedures would be best? 
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A Providing each worker with a Geiger counter to be carried at 
all times. 

B Providing each worker with a radiation-sensitive piece of pho- 
tographic film mounted in a badge. 

C Having each employee pass by a Geiger counter as he leaves 
the plant. 


D Taking a weekly X-ray of each employee and checking for 
radiation bone-damage. 


Samples of Social Studies Test Material 
(Grades 4-6) 


The students are presented with a simple map of an imaginary 
island on which places are indicated by numbers. 


о 350 700 
& Scole of Miles 


| They are asked questions such as: 
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1. If explorers came to this island by ship, where would they find 
the safest harbor? 


A2 B4 C9 D10 


2. Which of these places is on a peninsula? 
E4 F6 G7 H9 


(Grades 10-12) 
'The students are provided with monthly temperature and rain- 
fall charts for four places. 


90 August 


August 


60 м A0 ч 
м Hd к] December 
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10 


Inches of Roinfoll 


"They are asked such questions as: 


1. Which of these cities are north of the equator? 
A I and III only. 


B II and IV only. 
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C All of the cities. 
D None of the cities. 


2. In which city would one need the greatest variety of weights 
of clothing? 
ET ЕШ GIL НЈУ 


Suggestions for writing multiple-choice items 


The quality of multiple-choice items can be improved by follow- 
ing these suggestions: 


1. Be certain that each item has a central problem. One way to 
test this is to try and phrase the item as an essay item. Unless this 
can be done, there is no central problem. 


9. Be certain that there is only one correct answer, unless you are 
using the variation which permits multiple correct answers. 


3. Be certain that each option is grammatically correct and rele- 
vant to the stem. Test your items by reading the stem followed by 
each possible answer separately. This check will often turn up 
poorly worded responses. 


4. Avoid using phrases lifted directly out of the text. 


5. Avoid having the correct answer longer than the incorrect 
answers. 


6. Avoid writing “negative” questions, those which ask for the 
wrong answers rather than the right one. This type of item can be 
very confusing if included in a test where the student has a “set” 
to look for the correct answer. 


Matching items 


The matching item is a form of the multiple-choice item. It 
differs from the usual multiple-choice item in that a number of 
problems are presented simultaneously together with a number of 
answers, each of which is a possible answer to each of the problems. 
The problems and the answers are usually presented in two parallel 
lists with the problems in the left-hand list and the answers in the 
right-hand list. The student's job is to match each problem in the 
left-hand list with the correct answer in the right-hand list. The 
example below illustrates the form of the matching question, and 
illustrates also some defects often found in this type of item. 


30 Achievement with teacher-devised short-answer tests 


Instructions: In the blank in front of each statement in the first 
column place the letter preceding the word or phrase in the 
second column that is most closely related to it. 


1. Largest city in California. A. 1492. 
— — 2. A president of the United States. B. George Washington. 
3. The year in which Columbus C. Texas. 


discovered America. 
4. Largest state in the 


D. Los Angeles. 
United States. 


This example is a poor one for two reasons. First, the statements 
in the left-hand column have nothing to relate them to each other. 
Because they are so heterogeneous, each item has only one logically 
possible answer in the second column. The second criticism of the 
item is that there are an equal number of problems and answers. 
If each answer can be used only one time, the person who knows 


the answers to all of the problems except one will be able to 
answer the last problem by elimination. 


The following example is free of these two faults. 
1. First president of the United States. A 


— — 2. Only United States president to be B 
elected for four terms. 


. Eisenhower. 
‚ Lincoln. 


‹ C. F. D. Roosevelt. 
3. President of the United States who - T. Roosevelt. 
was a five-star general in World E. Truman. 
War II. F. Washington. 


— 4. President of the United States when 
the slaves were freed. 
The matching item is best used in testing factual knowledge 
such as names, dates, places. 


The ease with which the items can 
be constructed may lead the teacher to "overtest" on factual infor- 
mation. 


Suggestions for writing matching items 


1. The problems in a matching item should be homogeneous— 


should all be of the same general type. (E.g., dates, names, places.) 


2. There should be more possible answers provided than there 
are problems presented. 


3. Each matching item should be relatively short. If a long list 


of problems is to be presented in matching form, split them into 
two or more items. 
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4. Arrange the possible answers in a logical order (alphabetical, 
chronological, etc.) if such an order exists. In the last example 
above note that the presidents are listed in alphabetical order. 


True-false items 


'The true-false item consists of a statement which is to be judged 
either true or false by the student. 

Example: There are two pints in a quart. T F 
The student responds to a true-false item by choosing either the 
T if he believes the statement is true or the F if he believes the 
statement is false. 

Sometimes the difficulty of the true-false items is increased by 
requiring the pupil to make every false statement true by replacing 
a key word in the sentence. 

Example: There are three pints in a quart. T F 
If the pupil marks the statement false, he is expected to write a 
word in the blank which will make the statement true. In the 
above example, the pupil would be expected to circle the F and 
write the word “two” in the blank. 

True-false statements have had wide use in teacher-made tests. 
The fact that they can be written easily has led some teachers to 
construct them carelessly and use them excessively. Teachers often 
construct true-false tests by "lifting" a number of true statements 
directly from the book, making some of the statements false by 
changing a word or by inserting a negative at a convenient spot. 
Unfortunately, tests constructed in this manner are usually poor 
tests, in that they encourage rote memorization of text material, a 
goal not usually endorsed by the teacher. 

The true-false item should only be used when a simple state- 
ment is either completely true or completely false. Since only a 
small percentage of important items in most areas of learning meet 
this criterion, the number of true-false items which can be used 
in a test is limited. Determining if a statement is 100-percent true 
is sometimes difficult. The pupil faced with the problem of decid- 
ing whether the following statement is correct is in a quandry. 

If three boys divide a dollar among them, each 

will have 33-1/3 cents. ap F 
Is it true or is it false? Since $1.00 divided by 3 is 38-1/8 cents, the 
item is true, but since there is no such thing as 1/3 cent in our 


32 Achievement with teacher-devised short-answer tests 


monetary system, the boys cannot have 33-1/3 cents. Therefore, 
the statement is false. Pupils should not be forced to guess which 
of these interpretations the teacher wishes them to make. An item 
which is partly true and partly false should not be included in a 
true-false test. 

Although the true-false item has received its greatest use in test- 
ing memory for simple facts, it is possible to utilize this item form 
in testing more complex reasoning processes. Certainly the truth 
or falsity of the following item is not determined by recourse to 
rote memorization. 

A box 6” x 8" x 12" contains the same number of 

cubic inches as a box 3” x 16" x 12”. T F 

The true-false item is one which most teachers will find useful 


if they recognize its limitations and use it sparingly with the follow- 
ing safeguards: 


1. Use the true-false item form only for items which are either 
100-percent true or 100-percent false. 


2. Avoid writing true-false items in which the statements ave 
lifted verbatim from the text. 


Direct question and completion items 


In contrast to the selection-type items just discussed, supply-type 
items require the student to supply the answer in his own words. 

The two major forms of the supply-type item are the direct 
question and the completion item. The direct question is actually 
a form of the essay item, but it is usually restricted to questions 
which can be answered in a word, a sentence, or a number. The 
following example illustrates this type of item: 

Who discovered America? " 
This same item can also be written as a completion item as follows: 

America was discovered by ; 
However, in this form the item is ambiguous since the answer “ас- 
cident" is just as correct as "Columbus," although the item-writer 


probably did not have "accident" in mind as a ossible correct 
I y P 
answer. 


Supply-type items often have more than one correct answer. 
Many words have synonyms and near-synonyms, and it is often 
difficult to determine just when an item has been answered cor- 
rectly. Even mathematical problems sometimes pose a problem for 
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the scorer. For example, if the correct answer to a problem is 5-1/3, 
is 5.33 right, or is it wrong? 

The advantages of the supply-type of item are that they are com- 
pletely free of the effects of guessing and that they motivate pupils 
to learn material to the point where it can be recalled. The major 
disadvantage is the difficulty of determining which answers shall be 
accepted as correct. 

Supply-type items are particularly useful for use in mathematics 
or science, where the results of complex reasoning processes can be 
represented by a few symbols or numbers. 


Suggestions for writing supply-type questions 


1. When possible, use a direct question rather than the comple- 
tion form of item. 


2. Use only questions which can be answered by a unique word, 
phrase, number, or symbol. 


3. Avoid using statements lifted directly out of the book, since 
this tends to overemphasize rote learning. 


4. In computational problems, specify the units in which the 
answer is given and also the degree of precision expected. 


5. Avoid using completion items with too many words omitted. 


General suggestions for writing short-answer test items 


The following suggestions apply to writing all types of short- 
answer test items. 

l. As ideas for test items occur, make a note of them. During the 
day-to-day teaching activities, ideas for good test items will occur to 
the teacher. Unless notes are made while the idea is fresh in mind, 
the chances are that it will not be remembered when the time comes 
to construct a test. 


2. After a group of test items has been written, some other 
teacher or other person who knows the material well should look 
them over and try to answer them. If another teacher does not 
agree with the answer to a question, there may be something wrong 
with the item. 


3. If someone else cannot be found to criticize the items, they 
may be put aside for a few days, then read again. Often this pro- 
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cedure will reveal ambiguous items which were not apparent at 
first glance. 


4, The reading difficulty of the test items should be kept as low 


as possible, unless the test is to be used to measure the student's 
reading ability. 


Assembling, administering, and scoring the objective test 


A test is a collection of items. Tests can be composed of items 
all of the same type (e.g. true-false test, multiple-choice tests), or 
they can consist of a variety of item types. Including a variety of 
item types is usually preferable in a lengthy test, since it provides 
more flexibility in covering the material. 

If a variety is used, each type should be grouped into a separate 
section of the test. Thus a test might consist of a group of true- 
false items, plus a group of multiple-choice items, plus a group of 
direct-question items. Teachers frequently include both objective 
and essay items on the same tests. If a group of items contains some 
items which are more difficult than the others, it is preferable to 
place these items at the end of the group. 

In assembling the items, care should be taken that the occurrence 
of correct responses follow a random pattern. Avoid a 
tern of correct answers. Also avoid having any particular response 
position as the correct answer more frequently than any other 
response position. Thus in a set of four-choice multiple-choice 
items choices 1, 2, 3, and 4 should each be the correct answer 
approximately one fourth of the time. In a set of true-false items 


there should be approximately the same number true as there 
are false. 


Objective tests are usually dittoed or mit 
pupil has a copy of all of the items. 
answers by marking directly on the test o. 
answer sheet. Separate answer sheets s 
mature students who will not be confused by their use—usually 
junior high school and senior high school students. If answers are 
to be recorded on the test paper, the scoring may be facilitated by 
providing spaces for answers to the items in a column down one side 


of the test. A scoring key can then be laid beside the column, and 
the right and wrong answers easily determined. 


The assembled test should include directions to the pupils. These 


regular pat- 


meographed so that each 
Pupils may indicate their 
r by marking on a separate 
hould be used only with 
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directions should specify the method to be employed in responding 
to the items (e.g., circle the correct answer; cross out the letter that 
corresponds to the correct answer). The pupil should also be in- 
structed whether to "guess" or not when he is not sure of an answer. 
Formulas have been developed which are designed to correct for 
"guessing." The use of these formulas is neither necessary nor advis- 
able when sufficient time has been allowed for almost all pupils to 
attempt all of the items, and when pupils have been instructed to 
respond to all items, even if they are not positive of the answer. For 
short-answer teacher-made tests it is recommended: (1) that suffi- 
cient time be allowed for all pupils to attempt all of the items, and 
(2) that pupils be instructed to answer all items on the test. 

The importance of a carefully prepared scoring key is sometimes 
overlooked. This key should be checked and rechecked to be cer- 
tain that it contains no errors. The actual scoring process consists of 
comparing the pupils' responses with the answer key and indicating 
which are right and which are wrong. 'The most widely used 
method of scoring is to give one point for each short-answer ques- 
tion answered correctly. Scoring methods that give various weights 
to different items have not proved useful. The total number of 
items right is the students' "raw score" on the test. The importance 
of accuracy in scoring tests is obvious. Provisions should be made 
for some means of checking scoring, especially on important tests. 
This check could take the form of a rescoring by the teacher, a 
second scoring by an assistant, or reviewing the answers to the test 
during a class period with pupils checking their own responses. 


Analyzing the results of an objective test 


With experienced teachers the testing process does not stop with 
scoring the test and recording the grade; instead the test results 
serve as guides to further teaching and also as means of improving 
the quality of future tests. 

By examining the responses of the class to individual items in the 
test, the teacher can discover items which were missed by many 
members of the class. Investigation can then reveal why the item 
was so difficult, and if some fact, principle, generalization, or other 
objective of instruction has not been learned, the teacher can 
"reteach" the objective. By having pupils explain why they chose 
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the incorrect answer that they did, the teacher often gets insight 
into the nature of the pupils’ difficulties. 

In addition to discovering items which are difficult for the class 
as a whole, the teacher can use an individual's test paper diagnos- 
tically by finding out what difficulties the individual is having, and 
then working with him to overcome these difficulties. 

Teachers can also use the results of testing to discover important 
information about the quality of their test items. If the test items 
are discussed with the class after the test has been given, items 
which are ambiguous, which have no right answers, or which have 
more than one right answer, may be discovered. The fair teacher 
will not penalize pupils for poor items and will discard obviously 
poor items from the test. 

One method of discussing the test with the class before scoring 
the test is to have pupils mark their answers on both the test itself 
and on a separate answer sheet. The answer sheets are collected 
after the test, while the pupils keep their copies of the test in front 
of them. Then, before the tests are scored, the test items are dis- 
cussed with the class, and pupils are given their chance to comment 
on any of the items on the test which they do not understand. If, 
during this discussion period, any items which are 
otherwise poor are discovered, these items can be om 
scoring key when the separate answer sheets are scor 
the poor items will be revised before they are used 

The clerical work involved in determin 
chose each response to each item in an objective test can be reduced 
if this activity is made a part of the discussion of the test. For 
example, when discussing a particular multiple-choice item the 
question may be asked “How many chose the first answer?” “How 
many chose the second answer?” and so on. Before using the items 


in another test it may be desirable to make revisions, 
responses which were not 


ambiguous or 
itted from the 
ed. Naturally, 
in future tests. 
ing how many pupils 


Е replacing 
1 attracting any of the pupils who did not 
know the right answer. By keeping and referring to a file of old 


tests or test items, together with a record of how difficult the items 
were, the teacher can continually improve her own tests. 


EXERCISES 


Have a committee chosen from your class devise a 15-minute 
short-answer test covering this chapter. Let the remainder of the 


m 
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class take the test. After the class has taken the test, have a class 
discussion about the quality of each item in the test. 

9. Obtain a teacher-made short-answer test in your field and exam- 
ine it to see how it might be improved. When you find items in 
the test which you think are poor, rewrite them to see if you can 
improve them. 

9. After you read the next chapter compare the short-answer test 
with the essay test. What advantages does the short-answer test 
have over the essay test? What advantages does the essay test 
have over the short-answer test? 

4. Prepare ten multiple-choice test items on - vocabulary, for a stated 
grade and subject. 


SUGGESTED ADDITIONAL READINGS 


Adkins, D.C. Construction and Analysis of Achievement Tests. 

Gerberich, J.R. Specimen Objective Test Items. 

Henry, N.B. (Editor). Measurement of Understanding. 

Lindquist, E.F. (Editor) .Educational Measurement. 

Odell, C.W. How to Improve Classroom Testing. 

Remmers, H.H., and Gage, N.L. Educational Measurement and 
Evaluation. 

Ross, C.C., and Stanley, J.C. Measurement in Today’s Schools. 

Thorndike, R.L., and Hagen, E. Measurement and Evaluation in 
Psychology and Education. 

Travers, R.M.W. How to Make Achievement Tests. 


CHAPTER 3 


Evaluating Achievement with 
Teacher-devised Essay Tests 


question item used in short-answer tests can be scored either right 
or wrong, whereas the essay item permits answers which vary in 
their degree of rightness. The question, “Who is the president of 
the United States?” has only one correct answer and can therefore 
be considered a form of objective item. The essay item, “Describe 
and explain the duties and powers of the president of the United 
States of America,” permits many different answers which vary 
greatly in the extent to which they are considered correct by the 
person scoring the examination. The second difference between the 
direct-question short-answer item and the essay item is found in 
the length of the answer. Direct questions used as Objective items 
usually can be answered in a word or a phrase. Essay items gener- 
ally require considerably longer answers, 


Purposes of essay tests 


Some courses such as English or journalism include among their 
objectives the ability to organize and present material in written 
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form. The most direct and obvious way to measure achievement in 
this area is by means of the essay test or the assigned paper. Since 
papers written outside of the classroom are sometimes produced by 
persons other than those who submit them, the essay test becomes 
the best measure of how well a pupil can handle the English 
language. Essay tests used in English composition classes can 
reveal the ability of the pupil to express himself in an organized 
fashion. They can also be used to obtain evidence regarding the 
pupil’s achievement in grammar, spelling, and handwriting. 

The essay test is also useful in measuring the complex objectives 
of instruction in other courses, such as social studies. Although 
short-answer items can be written to measure complex mental proc- 
esses, they are difficult to construct. For this reason, many teachers 
prefer to utilize essay items to measure their pupils' ability to 
organize and critically evaluate facts and ideas drawn from broad 
and complicated bodies of subject matter. 


Advantages of the essay test 


The greatest advantage of the essay test is its suitability in meas- 
uring those complex learnings which cannot easily be measured by 
means of short-answer tests. It also has the advantage of encourag- 
ing pupils to study and learn material in large and interrelated 
units rather than as fragmentary and isolated facts. Another advan- 
tage of the essay test is the relative speed with which essay tests can 
be written as compared with short-answer tests. 


Disadvantages of the essay test 


The major criticism of the essay test is that the reliability of 
scoring the test is usually low compared with that of short-answer 
tests. Many studies confirm that two different persons scoring a set 
of essay tests may differ widely on scores assigned to the papers. 
'There are, however, methods of scoring which lead to increased 
agreement between scorers so that this disadvantage need not neces- 
sarily be a serious one. 

The essay test has also been criticized on the grounds that it 
leads to inadequate sampling of material studied. This criticism is 
based on the fact that essay tests sometimes consist of a very small 
number of items. If a pupil does not happen to be well prepared 
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on one of the items, it may lower his score greatly, even though he 
may be well prepared in a majority of the areas which the кү ч 
supposed to cover. The pitfall of “limited sampling can. j e 
avoided by using a larger number of items, each of which requires 
brief essay responses (а paragraph or two), in preference to a few 
items each requiring long and involved answers. 


Since essay tests are very time-consuming to score, adequate time 
must be budgeted for reading the papers. 


Constructing the essay test 


Some goals can best be ev 
and other by means of ess. 
evaluated by means of sho 


aluated by means of short-answer items 
ay items. If a goal can be adequately 
rt-answer items, they should be used in 
preference to essay items. Reserve the essay item to evaluate those 
goals which cannot be easily or adequately measured by short- 
answer items. Questions of a factual nature which require the pupil 
to answer "who," "what," "when," or "where" should be tested by 
use of short-answer items. Essay items are usually reserved for the 
measurement of more complex learnings. Typical of essay items are 
those which require the Pupil to "explain," "compare," "contrast," 
"interpret," "show differences," or "summarize." There are many 
more “key words” which are characteristic of essay items. What they 


all have in common is the fact that they require the pupil to demon- 
Strate his understanding of what he has learned. 


After determining which goals rec 
make a brief outline of the content to 
determine the relative import 


]uire the use of essay items, 
be covered by the items. Then 


ance of the various parts of the con- 
tent, and decide which aspects of the content to include in the 


test items. Finally, write the items. Usually a better test will result 


from the use of a relatively large number of short essay items, 
rather than just a few long items, 


Items should usually be specific enough so th 
have to guess the nature of the expected 
the United Nations” is so broad that the only possible justification 
for its use would be to see what aspects of the United Nations 
seemed to be important to the individual pupils. Many more 
specific questions could be written to elicit pupils’ knowledge about 
certain aspects of the United Nations. For example, ‘Discuss the 
purposes for organizing the United Nations.” “On what grounds has 


at the pupils will not 
answer. The item “Discuss 
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the United States of America opposed the admission of Communist 
China into the United Nations?” and so on. 

In general, the essay items should be as specific as possible, as 
long as the specificity does not defeat the intended purpose of the 
item. Every effort should be made to make the item clear and 
unabiguous, so that each pupil will have the same understanding 
of what kind of an answer is expected. As with short-answer items, 
essay items can profit from previous trial on persons who know 
something about the material being tested. 

The practice of allowing pupils to select the items they wish 
to answer from a longer list of items is open to serious question. 
When this practice is followed, it becomes impossible to compare 
the performances of the various pupils with each other. Thus, one 
of the usual purposes of an examination, comparison of pupil 
achievement, is defeated. All students should be required to run 
the same race (answer the same questions) when essay items are 
used to measure common learnings. 


Scoring the essay test 


There are two basically different methods of scoring essay items— 
the analytical method and the sorting method. 

The analytical method consists of constructing a model answer, 
analyzing this answer into a number of separate elements, and as- 
signing some arbitrary number of points to each of the elements. 
After the model answer is constructed, each pupil's answer is com- 
pared with this answer and assigned points according to whether the 
answer contains the elements included in the model answer. 

The sorting method consists of reading the answer to each ques- 
tion as a whole, without detailed analysis of the points or elements 
which it contains, and deciding on its over-all quality. After the 
question is read, the paper is placed into one of five piles labeled, 
"Superior," "Above average,” "Average," “Below average,” and 
“Poor.” This process is continued until all of the papers have been 
sorted into one of the five piles. Then the papers in each pile are 
reread to be sure that they have been properly allocated, and any 
changes which seem indicated are made. Finally, a numerical score 
is assigned to each of the piles, and each paper in that pile receives 
that score. The number 5 can be assigned to the “Superior” 
answers, 4 to the “Above average" answers, and so on with “Poor” 
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receiving a score of 1. The score of 0 can be reserved for complete 
failure to answer the question. This process is repeated for each 
of the questions on the test, and the total score on the test consists 
of the sum of the numbers assigned to the individual items. 

In using either the analytical method or the sorting method 


there are several suggestions which will improve the reliability of 
scoring. 


1. Before scoring the papers, devise a model answer. Decide how 
much weight should be placed on content, and how much on organ- 
ization. Unless handwriting, spelling, grammatical usage, etc., are 
included in the Objectives to be evaluated by a test, ignore these 
aspects of the answer in reading the papers. If any of these aspects 
are to be included in the over-all score, the paper should be read 
separately for spelling, grammar, etc., and a separate score assigned 


for performance in these areas. This score can be added into the 
the total score if desired. 


2. Read the questions anonymously—do not look at the name of 
the person who wrote the test until after the papers are scored. 
Knowledge of who wrote the answer can unfairly influence the test 


scorer. Anonymity can be secured by having students write their 
names on the backs of the examination papers where they will not 
be seen. 


8. Score only one question at a time. 
than one item, as it usuall 
Then score item two on 
items have been scored. 


- If the test consists of more 
Y will, first score item one on all papers. 


all papers, and so forth, until all of the 


EXERCISES 


l. Can you think of any possible use for a question like, “Discuss 
Shakespeare?” Prepare two or three short essay questions which 


might be used to measure an eleventh-grade student’s knowledge 
of Shakspeare and his works. 


Draw a parallel between the items on an essay test and the 
Separate events in the decathalon. Are the athletes permitted 


any choice in the events to be included in the decathalon? What 
would happen if each athlete were permitted to participate in 
any ten events of his own choice? What would happen if each 
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student were allowed to select the test questions he wants to 
answer? 


3. Define an objective in your field which can be tested by means 
of an essay item. Devise an essay item to measure this objective. 
Write a model answer for the item. 


4. List as many "key words" as you can which characterize essay 
items. (e.g., "explain, compare,” "'contrast") 


5. What is meant by the term "limited sampling," and how is it 
related to essay tests? 


SUGGESTED ADDITIONAL READINGS 
Lindquist, E.F. (Editor). Educational Measurement. 
Remmers, H.H., and Gage, N.L. Educational Measurement and 


Evaluation. 
Ross, C.C., and Stanley, J.C. Measurement in Today's Schools. 


CHAPTER 4 


Evaluating Achievement Through 
Products and Performances 


Although many objectives of instruction can be evaluated by the 
use of paper-and-pencil tests, many other objectives cannot be 
evaluated in this manner. Paper-and-pencil tests can be used to 
determine whether a student knows the rules of baseball, the cor- 
rect temperature to bake a cake, or the approved method of joining 
two pieces of wood. However, the fact that the pupil can give the 
correct answer to questions about baseball, baking, or woodwork- 
ing is no guarantee that he can actually play baseball, bake a cake, 
or make a bookcase. Since many objectives of instruction actually 


call for the pupil to be able to do something, rather than just 
answer questions about doing 


it, it becomes necessary to have 
methods to evaluate the “doing.” 


Frequently the act of doing something produces an end product, 
such as a cake or a bookcase, which can be evaluated after it has 
been completed. At other times there is no end product involved 
in the pupil's activity, and the teacher can only evaluate the act 
while it is being performed. Such activities as giving an oral 
report or performing on a musical instrument produce no per- 
manent product and must be evaluated while they are being per 
formed. In some cases the teacher has the choice of evaluating 
cither the end product or the performance. For example, the pupil 
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baking a cake could be evaluated on the quality of the cake after 
it had been baked or on the procedures which were followed in 
making the cake (using correct ingredients, measuring ingredients 
correctly, mixing ingredients properly, using correct baking tem- 
perature, etc.). At times the teacher may want to evaluate both the 
end product, when one exists, and the procedures used in arriving 
at that end product. 

The basic difference between evaluating a product and evaluating 
a performance is that the product can be evaluated at the teacher's 
convenience and can be examined at length. Performances, on the 
other hand, must be evaluated "on the run," and the teacher does 
not have a second chance to correct the evaluation. 

As with all evaluation, the first steps in evaluating a product 
or a performance consist of formulating the objectives, translating 
the objectives into specific measurable pupil behaviors, and then 
providing situations in which these behaviors can occur and can be 
Observed. At this point the teacher must have some device which 
can be used to record and measure the bebavior. Unless the teacher 
has carefully analyzed the behaviors which are expected from the 
student, she can make only a crude over-all evaluation of the prod- 
uct or performance. This evaluation may not be based on a con- 
sideration of the attainment of the important objectives of the 
assignment. 

5 Since the general method of evaluating products or performances 
15 essentially similar regardless of the particular product or pro- 
cedure to be evaluated, the evaluation of a written report will be 
used to illustrate the method. Assume that a teacher has given a 
tenth-grade social studies class the assignment: “Write a report on 
one of the South American countries, covering briefly the popu- 
lation, economy, geography, and political organization of the 
country.” After the class has completed writing their reports and 
has turned them in, the teacher is faced with the problem of 
evaluating the reports. One method of evaluating the reports 
would be to scan them one by one and assign a letter grade to 
each. Under this method of evaluation papers which contain very 
little factual material but which are neatly typed often receive 
higher grades than those which have more factual material but are 
handwritten. In similar fashion, a pupil who has done a good deal 
of research may have his grade severely lowered because of spelling 
mistakes even though spelling may not be a major objective of the 
particular assignment being evaluated. In order to keep in mind 
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just what is being evaluated in a particular assignment, some kind 
of a list of objectives is needed to guide the evaluation. In its 
simplest form this list is called a check list and consists of various 
characteristics of the assignment which are marked as being absent 
or present. The following check list illustrates the kinds of charac- 
teristics which might be included in evaluating a paper such as 
that assigned in the illustration. 


Specimen Check List for Evaluating a Written Report 
Yes No 

- Well-organized 

‚ Based on research 

- Covers topic adequately 

+ Stays within assigned topic 

- Free of spelling errors 

- Free of grammatical errors 

- Neat writing or typing 

(plus other items) 


а сл бого 


ШИ 
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The items to be included in a list of this type will naturally differ 
from teacher to teacher and from assignment to assignment. The 
check list is the simplest way of recording the presence or absence 
of certain qualities of a product or procedure being evaluated. 
Sometimes the teacher will not be willing to use a simple check 
list, especially when some of the items to be evaluated exist in 
varying quantities rather than merely being present or absent. By 
providing more than two options for each item in the list, the 
teacher is better able to reflect the varying degrees of “goodness” 
of the characteristics being evaluated. Instead of using the check 
list illustrated above, the teacher might want to allow for more 
refined evaluation of the characteristics of the paper by rating each 
characteristic on a five-point scale where 5 means outstanding, 3 
means average, and 1 means unsatisfactory. With this variation the 
device is normally called a rating scale and would appear as 


follows: 
Unsatisfactory Average Outstanding 
1. Organization 1 2 3 4 5 
2. Research 1 2 3 4 5 
9. Coverage 1 2 3 4 5 
4. Stays within 1 2 3 4 5 


assigned topic 
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Unsatisfactory Average Outstanding 
5. Freedom from 1 2 3 4 5 
spelling errors 
6. Freedom from 1 2 3 4 5 
grammatical errors 
7. Quality of writing 1 2 3 4 5 
or typing 


(plus other items) 


By circling the appropriate number for each characteristic to be 
rated, the teacher expresses her judgment of the quality of that 
particular characteristic. The over-all rating for the paper may be 
obtained by adding the points for the separate characteristics to 
obtain a total score for the paper as a whole. 

In both the check list and rating scale illustrated above, no effort 
has been made to weight the various characteristics evaluated. 
Usually some of the characteristics of the assignment being evalu- 
ated will be considered to be more important than other charac- 
teristics. If this be the case, the rating scale just considered can be 
modified to allow for this fact. Suppose that in the previous 
example the first four characteristics were considered more impor- 
tant than the last three characteristics. This difference in impor- 
tance could be reflected in assigning more possible points to each of 
the first four characteristics and fewer to the remaining three. Such 
a rating scale might look like this: 


Unsatisfactory Average Outstanding 
1. Organization 1 2 3 4 5 
2. Research 1 2 3 4 5 
3. Coverage 1 2 3 4 5 
4. Stays within topic 1 2 3 4 5 
5. Freedom from 1 2 3 


spelling errors 


6. Freedom from 1 2 3 
grammatical errors 

7. Quality of writing 1 2 3 
or typing 


(plus other items) 


By varying the number of points which can be earned for each 
characteristic of the assignment evaluated, weighting can be 
achieved to conform to the teachers’ ideas on the proper amount of 
importance to be given to each characteristic. 
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Most teachers will add the points awarded to each of the separate 
characteristics to obtain a total number of points for the assign- 
ment. ‘Then they convert the total points into a letter grade which 
can be recorded in a record book and which can be incorporated 
into an over-all grade for the semester. Even though this be done, 
the detailed evaluation of the product or procedure provided by the 
check list or rating scales is more informative to the student than 
the over-all grade and should be given to the pupil so that he may 
see what his strong points and weak points were on the assignment. 

Another example of a rating scale is the following one used to 
evaluate an original geometric design drawn with a straight edge 
and compass. Similar scales could be devised for use in assessing a 
painting or ceramic piece in art, a table in a woodworking class, or 
a skirt pattern in homemaking. 


Rating Scale for Original Design in Geometry 


The design: 
1. Shows originality 1 2 3 


4 5 
2. Exhibits the stipu- — 1 2 8 4 5 
lated directions as 
to size, type of 
paper, etc. 
3. Exhibits skillful use 1 2 5 4 5 
of tools 
4. Utilizes principles of — 1 2 3 4 5 
geometric construc- 
tion 
5. Exhibits accepted 1 2 3 4 5 
characteristic of 
good design 
6. Demonstrates the ex- 1 2 3 4 5 


penditure of effort 


1 = The design is poor in this respect. 
2 = The design is below average in this respect. 
3 = The design is average in this respect. 


1 = The design is above average in this respect. 


5 — The design is superior in this respect. 


The following suggestions should 
using rating scales or check lists for 
cedures. 


prove useful in devising and 
evaluating products and pro- 
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+ Include only observable, measurable characteristics in the list. 
2; Keep the number of characteristics as short as possible while 
still covering the important characteristics to be evaluated. 

3. 1f some characteristics are more important than others, pro- 

vide for this by assigning more "possible points" to the more 

important ones. 
4. In using the check list or rating scales, beware of the “halo” 
error. This error is made when the teacher assigns points on 
the separate characteristics on the basis of some over-all im- 
pression of the product or procedure, or even of the student 
who produced it. Thus, an average product produced by a 
pupil whom the teacher regards as being a good student 
might receive a higher score than a good product produced 
vu pupil whom the teacher thinks of as being an average 
student. 


EXERCISES 


List five products which might be developed or constructed by 
Pupils in your field. 

List five performances in your field which relate to achievement 
and which require evaluation. 

Develop a check list for evaluating a product or a performance 
in your teaching area. 

Develop a rating scale for evaluating a product or a performance 
in your teaching area. 

Obtain a check list or rating scale presently being used at some 
level of education. Examine the check list or rating scale to see 
if you can suggest any improvements. 

As a class project, develop a check list or rating scale to evaluate 
a product or performance assigned in the class in which you are 
studying this text. 


SUGGESTED ADDITIONAL READINGS 


Micheels, W.J., and Karnes, M.R. Measuring Educational Achieve- 


ment. 


Remmers, H.H., and Gage, N.L. Educational Measurement and 


Evaluation. 


Thomas, R.M. Judging Student Progress. 


CHAPTER 5 


Evaluating Typical Behavior with 
Teacher-devised Instruments 


Meaning of typical behavior 


In the preceding chapters methods of measuring pupil achieve- 
ment have been discussed. These methods are used when the 
teacher wishes to determine what the student can do when he is 
trying to do his best. It is assumed that pupils will try to get the 
best scores they can on tests or on assigned papers. In addition to 
these measures of “best” behavior, the school is often interested in 
evaluating the pupils’ customary or typical behavior. The differ- 
ence between "best" behavior and typical behavior can be illus- 
trated by considering the problem of evaluating a boy's ability to 
drive a car. If he is given an examination for the purpose of 
granting him a driver's license, there is little doubt that he will try 
his best to obey all of the rules and to handle the car with care. 
This is an example of test behavior. Even though the boy passes the 
driving test with a high score, there is no assurance that he will 
customarily or typically drive with the same degree of skill or 
caution. 

The specific typical behaviors which the teacher will be interested 
in evaluating will be determined by the specific objectives of the 
school and of the teacher. Generally, the behaviors evaluated will 
fall into the areas of personal and social behavior. Such characteris- 
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tics as work habits, ability to work with others, and citizenship are 
among those typical behaviors which are evaluated in many class- 
rooms today. 


Evaluating typical behavior through observation 


The principal method utilized in evaluating typical behavior is 
observation. Teachers spend much of their time observing their 
Students. If the teacher knows what she is seeking as she observes 
her pupils, she can obtain much valuable information about them. 
Observations may be either planned or informal. When the teacher 
is consciously observing selected pupils for specified behaviors, the 
observations are planned. Informal observation takes place when 
the observer notices, and eventually records, behaviors which arise 
unexpectedly. Whether planned or informal, observations should 
be recorded. A check list or rating scale may be utilized to record 
formal observations. For recording observations about the work 
habits of a pupil a check list like this might be utilized: 


Yes No 


Begins work without delay 
Has necessary books and supplies —_ — 
Continues work without unnecessary interruptions ————  ——— 
Asks questions if he does not understand assign- 

ment а= 


Another method of recording the observation would be a rating 
Scale. One form of scale for rating work habits would be: 


1 2 3 4 5 
Does not do assigned work Starts to work immediately 
oes not have necessary books Has necessary books and sup- 
Or supplies plies ы 
Interrupts other pupils Continues to work steadily 
Does not ask questions if Asks questions if necessary 


necessary 


To use this scale, the teacher would make a mark on the scale to 
indicate the work habits of the pupil. Five would indicate the 


best work habits and one the poorest. : 
In the rating scale above, only the two extremes of behavior are 
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defined. Additional refinement can be obtained by similiarly defin- 
ing the three graduations between the extremes. К 
A slightly different approach to recording typical behavior, which 
is used in special instances, is that referred to as time sampling. To 
study a particular pupil, an observer may observe his behavior dur- 


ing a specified period of time. A report of this kind might read 
as follows: 


10:30 John comes into the room, sits down quietly, and after 


inspecting several books, chooses his history text. 
10:31 Begins to study. 
10:3114 Speaks to his neighbor. 
10:32 Returns to his studying. 
10:33 Sharpens his pencil. 
10:33Y4 Returns to seat and stares out of the window. 


Obviously, the regular teacher in the classroom cannot carry on 
much of this type of observation, although the value of this report 
for diagnosing special problems is obvious. 

Evaluation of an objective through observation requires that a 
sufficiently large sample of behavior be included. Thus, if the 
teacher were evaluating work habits she should be sure to observe 
each of the pupils on many different occasi 
to be certain that what she has observed w. 
than atypical behavior. 

By spending a few minutes each day in observing the typical 
behavior of a few pupils, and by observing different ones each day, 
the teacher can soon get to know her pupils much better than if 
she does not systematically observe them as individuals. When 
classes contain between thirty and forty children, many of the 
children will go unnoticed unless the teacher focuses her attention 
on them. Of course, the problem child always receives his share of 
attention and notice, but the majority of the class may escape notice 
as individuals. Then, when the time arrives for a formal evaluation 
of a pupil's behavior or a discussion with his parent, the teacher 
is suddenly faced with the fact that she really knows very little about 
the child’s typical behavior in the classroom. 


ons during the semester 
as typical behavior rather 


Informal observation: The anecdotal record 


An anecdotal record consists of a simple, factual account of an 
observed incident. As a general rule anecdotes should be written 
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in two situations: (1) to record unusual incidents, and (2) to 
describe unusual pupils. Unusual incidents which occur should 
always be recorded, even though the pupils involved may be well- 
adjusted individuals. These anectodes become part of the pupil's 
record and are kept in a folder along with other information per- 
tinent to evaluating the child's behavior. The availability of data 
of this nature places the teacher in a position to understand her 
pupils better and to report their school behavior more intelligently. 
For those pupils who are emotionally disturbed, who fail to get 
along with other children, or who have other behavior problems, 
the teacher should use anecdotal records extensively. In these cases, 
the teacher might follow these students through a day or week, 
writing reports at regular times. Another system involves writing 
anecdotal reports on these pupils for every activity during a short 
period. These reports are very helpful when, with the principal, 
supervisor, counselor, or school psychologist, a plan of action is 
developed to help the child with his adjustment problems. 
Anecdotes should be restricted to an account of the actual behav- 
iors observed. Judgments of the rightness or wrongness of the 
behavior should not be included. If the teacher has an opinion as 
to the cause of the behavior, such opinion may be included but 
should be clearly labeled as teacher opinion rather than actual 
observed behavior. 
The following anecdote is acceptable because it confines itself to 
reporting observed behaviors: 
Name: Jonn Dor Date: May 7, 1957 
At nutrition John grabbed a cookie from Mary S. When she 
tried to get it back, he pushed her. I separated them and told 
John to return the cookie. He did so. When I asked him why he 
had taken the cookie, he would not answer. 


A poor anecdote based on this same incident might read like this: 


Name: Jonn Dor Date: May 7, 1957 
John caused a commotion at nutrition today. He is a chronic 
trouble-maker. It is his parents’ fault for not teaching him better 


manners. 

Anecdotes should not be collected as an end in themselves. Un- 
less the anecdotes are actually used in some way, the time taken to 
write them is wasted. 
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Obtaining information from the informal reports of pupils 


In trying to understand their pupils, teachers often need informa- 
tion about a child's interests, how he spends his out-of-school time, 
and other data which cannot be acquired by direct observation. 
This type of information may be obtained in an informal manner 
from the students by talking with them, or by having them give oral 
or written reports containing the type of material that the teacher 
is seeking. Examples of such reports include such assignments as, 
"My autobiography," "My favorite hobby," and “How I would 
Spend One Hundred Dollars.” These topics can be assigned for 
the purpose of providing practice in written or oral expression and 
the content of the reports can be of great assistance to the teacher. 


Obtaining information about pupils from their peers 


In addition to obtaining information about the pupil directly 
from the pupil, it is also possible to obtain information about 
pupils from their peers—the other children in the class. Sometimes 
information can be obtained about pupils by discreet questioning of 
their classmates. "Teachers should use this procedure with caution. 

Information relative to the social structure of the entire class 
can be gained by using the methods of sociometry. Although skill- 
ful use of sociometric techniques requires additional training and 
experience, the basic idea is simple. Pupils are asked to list the two 
or three pupils with whom they would most like to serve on a com- 
mittee, play a game, or participate in other similar activities. With 
this information it is possible to compose a graphic picture of the 
social relationships in a class, and by tabulating the number of 
times each pupil was chosen by the others in the class, the teacher 
can readily determine the pupils most popular with their classmates 
as well as those not chosen by any of their classmates. Utilization of 
these techniques enables the teacher in a short time to gather con- 
siderable information about the interrelationships of the various 
members of her class. 

The graphic picture of the social relationships in a class is called 
a sociogram. The sample sociogram on page 55 shows the patterns 
of choice for a group of ten pupils. Each pupil was asked to name 
two other pupils that he or she would prefer to serve with on a 


social studies committee and to identify them by first choice and 
second choice. 
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Q =Girl <+ > = Mutual choice 
[|] =Boy 1 or 2=First or second choice 
—>=One-way choice 


A Sample Sociogram 


Sociograms are measures of social acceptance and as such can be 
used by the teacher in the following ways: (1) to identify isolates 
who can then be helped in building improved relationships with 
other children, (2) to identify groups and cliques, which because 
of race, religion, or status need to be absorbed into the class in 
order for effective group action to take place, (3) to establish a 
basis for separating a class into groups of pupils who can be 
expected to work well together. 

The "guess-who" technique is another means of collecting data 
about pupils from their peers. In the “guess-who” procedure, 
pupils are given statements describing types of behavior and are 
asked to designate other pupils who best fit these descriptions. 
Statements related to such traits as the following might be included 
in a "guess-who" questionnaire: neat and clean; takes care of public 
property; often gets angry; obeys orders; nobody likes very much; 


56 Typical behavior with teacher-devised instruments 
or, good at baseball. A sample “guess-who” questionnaire follows. 


You and Your Classmates—A Guess-who Questionnaire’ 


Name. Date. 


. Which children sit very still and quiet? 


. Which children wiggle a lot and can't sit still? 
. Who are the ones everyone likes? 


Who are the ones nobody likes very much?. i 


. Which children are always smiling and laughing? 

. Which children don't smile very much and seem sort of ѕад? 
- Which children are Ьоѕѕу? 
. Which children let the other children boss them?. 
. I would like best to work with 


© eo sq got Qe AR) p 
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. Which children are most bashful?. 


11. Which children aren't the least bit bashful?. 


12. Which children are the best at outdoor games? 


L——— 


13. Which children aren't very good at games?. 
14. Which children get mad the easiest? 


———————— 


15. Which ones don't get angry much?. 
16. I would like most to be like. 
17. I would like to have. 


for class president —— 1 1 


In order to keep choices from being forced, pupils should be 
informed that it is not necessary to name a pupil for each item if no 
one in the class fits the description. Accordingly, items should be 
included in the questionnaire which do not refer to specific indi- 
viduals. 

The information collected with a device of this kind is useful 
in countless school situations. ‘Guess-who” questionnaires have 
been developed to collect data for guidance, for identifying values 


important to children, and to indicate the degree of social accept- 
ance of groups and individuals. 


! Reproduced with permission of Los Angeles County Schools Office. 
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EXERCISES 


l. Differentiate between typical behavior and test behavior using 
college students’ grammar as an example. 

2. Develop a check list to be used in observing courtesy in a sixth- 
grade class. 

3. Develop a rating scale to be used in observing health habits for 
a grade level of your choice. 

4, Develop a rating scale to be used in observing habits of critical 
thinking. Identify the class and grade level where the scale is to 
be used. 

5. Arrange to have two members of your class act out some incident 
that might occur in a classroom. Have the other members of the 
class write an anecdotal record about the incident. Compare 
and discuss the anecdotal records prepared by various members 
of the class. 

6. Develop a “guess-who” questionnaire after identifying objectives 
and grade level. 


SUGGESTED ADDITIONAL READINGS 


Magnuson, H.W., et al. Evaluating Pupil Progress. 

Staff, Division on Child Development, American Council on Educa- 
tion. Helping Teachers Understand Children. 

"Thomas, R.M. Judging Student Progress. 

"Torgerson, T.L., and Adams, G.S. Measurement and Evaluation. 


CHAPTER 6 


Summarizing and Reporting Pupil 
Achievement and Typical Behavior 


Among the more complicated tasks the teacher undertakes is that 
of “marking” or “grading.” Most teachers would be relieved to be 
free of the necessity to mark or grade their pupils, but some kind of 
report is necessary for several reasons. 

First, parents want and have a right to know how their children 


are progressing in school. Second, some sort of permanent record 
is needed by school personnel to aid them in 


proper placement, 
promotion, 


and guidance of pupils and to provide the data neces- 
sary to evaluate a pupil’s achievement when prospective employers 
or colleges request information. Third, the classroom performance 
of the typical pupil is influenced, at least to some degree, by marks 
or grades reported. Usually summary marks or grades are not 
needed by the pupil for the purpose of knowing how he is doing in 
any given subject. His day-to-day successes and failures reveal to 
him with a fair degree of precision his progress or lack of it. Were 
periodic reports given only for the pupil’s use, they might be 
eliminated. 


The following are the most widely used methods of reporting to 
parents. 
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Parent-teacher conferences 


Face-to-face meetings between teachers and parents have been 
most widely used at the elementary school level. Such meetings pro- 
vide an excellent opportunity for the teacher to convey informa- 
tion regarding the pupil's achievement and typical behavior to the 
parent. The parent finds out about the nature and purpose of 
classroom activities. The teacher obtains more information about 
the pupil’s out-of-school behavior and environment, which is often 
helpful in understanding the pupil. In spite of the advantages of 
the method, there are certain practical difficulties which must be 
overcome. These meetings are time-consuming, especially at the 
secondary level, where a teacher may teach as many as 150 to 200 
different pupils during one semester. Also, getting parents to come 
to school for the conference sometimes can be difficult. Not all 
teachers are skilled at face-to-face reporting and therefore may be 
ineffectual in parent-teacher conferences. Unless thought and effort 
are used, parent-teacher conferences can become stereotyped, with 
the teacher using a few pat phrases to describe the achievements 
and behavior of her pupils. If the difficulties inherent in the 
method can be resolved, the parent-teacher conference is a most 
desirable means of reporting to parents. 


Letters to parents 


The next most flexible way of reporting to parents is the unstruc- 
tured letter. This method of communication allows comments on 
any aspects of the pupil’s achievements and behaviors which should 
be reported to the parents, Writing good letters which are truly 
individual and which convey the exact meaning intended is a diffi- 
cult task. There is a wide variation in teachers’ ability to write 
letters of this type. Letters are time-consuming, particularly if 
clerical help is not available. But assuming that the time is avail- 
able and the ability to write good letters is present, this means of 
report can be very effective. As with the conference method, the 
“letter to parents” is most likely to be used in the elementary school. 


Check lists 


More structured than the informal letter is the check-list type of 
report listing a number of descriptive phrases which can be checked 
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to indicate that the phrase applies to the pupil. Such check lists 
tend to be rather long with many specific judgments called for by 
the teacher. Such phrases as the following might be included in 
that portion of a check list which describes the pupil’s achievement 
in reading.. 

1. Reads with understanding. 

2. Reads well aloud. 

3. Handles new words efficiently. 

4. Reads independently. 


Since there are many different objectives to be evaluated at most 
grade levels, and since each objective involves the use of one or 
more phrases on the check list, most lists tend to be very long. 
Although the check-list type of report can provide highly informa- 
tive information to the parent, the large number of different items 
sometimes leaves the parents confused. Check lists are more widely 
used in the elementary rather than the secondary schools. 


Report cards 


The traditional means of reporting school achievement and 
behavior is the report card. In its simplest form the report card 
consists of a list of subjects (e.g., reading, arithmetic) in each of 
which the pupil is marked with either a letter or a number which 
signifies his achievement in the subject. The actual symbols vary 
widely from school to school. The most traditional are the A, B, 
C, D, F system and the percentage system wherein 100 signifies the 
highest possible mark. Some systems use only the symbols S and U 
where S means satisfactory and U means unsatisfactory. Many other 
sets of symbols are used to report achievement. There is no reason 
why a school should not adopt any set of symbols for use in report- 
ing as long as the meanings of the symbols are clearly understood 
by all who use them. Few report cards today consist exclusively of 
marks in subject matter. Almost all cards include some provision 
for reporting such typical behaviors as “citizenship,” “work habits,” 
“effort,” or the like. 


Although the traditional report card is a much simpler method 
of reporting than any of the other methods mentioned thus far, it 


is also the one which provides the least information for the parent. 
This difficulty arises from the fact that the marks usually are poorly 
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defined and по one knows exactly what a given mark means except 
the person who gave it. 

Many modern report cards combine elements of the traditional 
report card and the check list to provide a more flexible means of 
communication. 

The progress report reproduced on pages 62-65 is a good example 
of a combination report card and check list. Notice especially the 
variety of educational objectives that can be evaluated, and the 
fact that growth in knowledge and skills can be reported in terms 
of position within the class group and in terms of the pupil's own 
ability to achieve. 


Cumulative records 


In addition to reporting pupil achievement and typical behavior 
to parents, almost all schools maintain some type of cumulative 
record in which the marks assigned to pupils throughout their 
school years are recorded. Such cumulative records usually contain 
in addition information regarding pupils' attendance, records of 
standardized test results, anecdotal records, health data, records of 
participation in school and extracurricular activities, and data 
about the home and family and about pupils' interests and objec- 
tives. By referring to these cumulative records the teacher can 
quickly obtain much important information which will help her 
in understanding her pupils. These records contain the necessary 
material to answer questions from employers or other schools and 
colleges. 


Basis for assigning grades 


The “A” Johnny brings home in reading, can have a variety of 
meanings. It can mean that Johnny is reading consistently above 
his grade level. It can mean that Johnny is one of the better 
readers in his class even though not reading above grade level. It 
can mean that Johnny is reading as well as the teacher thinks he 
can, even though he may actually be one of the poorer readers 
in the room in terms of absolute achievement. The grade in reading 
might even be based on the fact that Johnny is a "good" boy and 
never causes the teacher any trouble in class. Teachers often mis- 
takenly give high grades to children who are “good” in class 
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FOURTH, FIFTH AND SIXTH GRADE 


MESSAGE TO PARENTS 


The staff of ‘the Rivera Elemen- 
tory School District believes thot 
the education of the pupil is а 
cooperotive enterprise in which the 
home ond school should work close- 
ly together. The school strives to 
help the child develop those skills 
ond ottitudes necessary for a de- 
siroble citizen in our democracy. 

This report with individual por- 
ent-teacher conferences should give 
the home o picture of the child's 
progress. 


Very truly yours, 


Cli R. Sted 


District Superintendent 


Principal 


fea 
= 
-d 
ja 
2 
A 
oil 
ө 
ө 
да 
ыў 
uv 
d 
[- - 
ы 
> 
Ве 


Teacher 


Reproduced by permission of Eli R. Steed, Superintendent 
Rivera School District, Rivera, California. 


ATTITUDES AND BEHAVIOR 


Almost always 
Part of time 
Very seldom 
Almost always 
Part of time 
Almost always 
Part of time 
Very seldom 


Work and Study Habits 


l Very seldom 


Listens attentively 


Follows directions 


Uses time wisely 


Does neat and careful work 


Begins ond finishes work on 
time 


Speaks only in turn 


Social Development 
Gets along well with others 


Accepts responsibility 
Respects rights and property 
of others 


Recognizes and solves own 
problems 


15 courteous 


Is developing self control 


Health and Safety 


Practices cleanliness 


Maintains good sitting and 
standing posture 


| | obeys rutes and regulations 


Days Absent Height and Weight 


Mar. June Nov. June 


ET TE] ш 


Conference 
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FOURTH, FIFTH AND SIXTH GRADE 


GROWTH IN KNOWLEDGE AND SKILLS 


Explonation of marks in [| 


Achievement level (based on child's 
Position within his class group, os 
fudged by standardized tests, teach 
тобе tests, ond teacher dbserva- 
tion. 

A—Superior 

Above average 

C—Average 

D—8Below average 

F—Failing 


Explanation of morks in Q 


Marks in this column show the rela- 
tionship between o child's ochieve- 
ment and his obility to achieve. (Ef- 
fort) 
1—Commendoble, is doing un- 
usually good work in terms of 
his own, ability 
2—Sotisfoctory, making progress 
consistent with his ability 


*—Comment enclosed 
Desirable growth is listed below 
each subject. You will find on "N'* 
where your child needs to improve, 


3—Needs to improve, progress. 
not consistent with ‘his ability 


Nov. Mar. June 


Nov. Mar. June 
Reading O O O [ 
Reads with understonding 
Applies phonetic understandings 
Uses dictionary skills. 
Shows Interest in independent 


reading 
Reads well orally 


Finishes reading assignments 
Arithmetic 


Understands processes tought 


Works accurately 


Solves word problems. ' 


Language 


Expresses ideas effectively in 
‘Speaking 


Expresses ideas effectively in 
writing 

Uses longuoge fundomentals 
correctly 


Spelling 


Spells carefully in all written 
‘work 


Knows words in spelling list 
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June Social Studies Nov. Mar. June 


History & Geography of O © O 


Understands and uses facts of 
social studies 


Contributes to group octivities 
end discussions 


Interprets globes, mops, ond 
charts 


Science ОО @) 


Shows growth in scientific focts 
‘and understandings 


Uses scientific method in solving 
problems 


Handw DAO 


Uses neat and legible 
handwriting 


Music О) 0) Ф) 


Responds to and enjoys music 


Shows growth in musicol 
octivities 


Art COO 


Shows growth in obility to 
express ideas creatively 


Shows interest in ond enthusicsm 
for art 


Uses art materials with care and 
understanding 


Physical Education O О O 


Shows progress in good 
sportsmonship 


Grows in physical skills 


RECOMMENDED PLACEMENT.. 
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behavior even though they may not be equally good in their 
achievement. It is obvious then that the “A” on Johnny’s report 
card is meaningless until the method used in determining the grade 
is known. ч 

There are three main ways of assessing achievement— (1) in 
relation to absolute standards, (2) in relation to the performance 
of the other pupils in the same class or grade, and (3) in relation 
to the pupil’s own ability. The differences between these can be 
illustrated by considering the case of the pupil in the fifth grade 
who has an І.О. of 85 and whose reading grade placement is 4.5, 
which means he reads as well as the average child in the fifth month 
of the fourth year. If this child were graded on the basis of absolute 
standards, he would probably receive a C or D in reading. If he 
were a member of a slow class, he might be one of the better 
members of the class and, if graded on his performance relative 
to that of his classmates, might receive a B. If he were graded on 
the basis of his achievement in relationship to his own ability he 
might even receive an A. Thus, depending on the method of 
grading used, this pupil might receive any grade from A to D for 
the same work. When each teacher in a school independently 
decides what a grade should mean and on what basis it should be 
assigned, the grades become meaningless. It is essential that 
throughout a school system there be as much agreement as possible 
on the basis for grading and on the meaning of the symbols used in 
reporting grades. 

In the preceding example there were two facts which should 
have been conveyed to the parent. (1) The pupil was reading as 
well as could be expected in terms of his ability, and (2) he was 
reading below the average child in his grade. No single grade 
can convey both of these pieces of information. It is, of course, 
possible to report the two separately. Any system of assigning 
grades which will convey both of these facts is superior to a single- 
symbol system. The addition of a series of check-list phrases which 
would enable the teacher to indicate the strengths and weaknesses 
of the pupil in various aspects of reading would strengthen the 
reporting system. 

The study of the Pro 


gress Report reproduced on pages 62-65 
will show that provision 


s have been made for separately reporting 
achievement with respect to class standards and with respect to the 
pupil’s own ability. Report cards used in secondary school usually 
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report only a single mark in a subject, one based on the pupil's 
achievement relative to the other pupils at the same grade level. 


Collecting the evidence on which to base the report 


Whatever form of reporting is used, the teacher must collect evi- 
dence on which to base the report. No system of reporting which 
depends entirely on the subjective impressions and memory of the 
teacher can be fair to the pupils. The teacher needs evidence in 
the form of records of the pupil's achievement and typical behavior. 
Such records are usually kept in the teacher's classbook or rollbook 
and/or in some kind of notebook with a separate page or section 
devoted to each pupil. Since the teacher must usually collect many 
kinds of evidence, and since some of the evidence must be recorded 
in the forms of anecdotes or comments, there is usually not sufficient 
room in the classbook alone to meet this need. Some teachers prefer 
to keep a folder for each pupil and record all information pertain- 
ing to the pupil in the folder. The use of a folder enables the 
teacher to preserve samples of the pupil's work. These samples of 
work can be used to good advantage during a parent-teacher con- 
ference. 

Consideration of the objectives to be evaluated will determine the 
kinds of evidence to be collected (see Chapter 1 for an illustration 
for collecting evidence on "work habits") Only by deciding what 
kinds of evidence to collect, and by setting up a system for collect- 
ing and recording this evidence, can the teacher be in a position to 
adequately and fairly report on her pupil's achievements and typical 
behaviors. 


Improving marking and grading practices 


From the preceding discussion it should be clear that the process 
of reporting is a complex one with many pitfalls for the unwary. 
Since most teachers are required to report their pupil's progress, 
the problem becomes one of improving marking and reporting pro- 
cedures. Usually such improvement can best be achieved by the 
cooperative efforts of teachers, administrators, and parents. 

The following criteria should be met in any system of marking 
and reporting. 

l. The system should communicate all of the important informa- 

tion about the pupil's achievement and behavior to the parent. 
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2. The meanings of thie symbols used and the basis for assigning 


such symbols should be clearly understood by both parent and 
teacher. 


8. All teachers of the same educational level in a particular 
school or school district should have the same philosophy of 
marking. This is essential if marks are to have any meaning. 

4. The system of marking and reporting should not be so com- 
plicated as to place an undue burden on the teacher. 


Quantitative aspects of marking 


Assigning marks is essentially a philosophic rather than a statis- 
tical process. However, most teachers use numbers in some way in 
assigning marks and in arriving at summary grades for reporting 
purposes. 

Consider that a spelling test of twenty words has been given to 
a seventh-grade class and that one of the pupils in the class spelled 
fifteen of the words correctly. What score should the student receive 
on the test? One method of scoring would be to use the number 
right (fifteen) and enter this number in the record book. This 
kind of score is called a raw score. Another method of scoring 
would be to use the percentage system and assign a score of 75 
percent since each one of the twenty words represented 5 percent 
of the total score. Still another method would be to assign a letter 
grade (e.g, A, B, C, D, or F) to the test and record that grade in 
the record book. 

Which method should a teacher use? To answer this question the 


meaning of a test score must be considered. Actually what is known 
about this pupil's spelling ability from the fact that he got fifteen 
right out of the twenty words on the test? The answer is “almost 


nothing.” Suppose that the test were composed of twenty easy 
words such as “cat” and “dog.” Missing even five of such easy words 
would constitute very poor spelling ability for the seventh grade 
and might even represent the performance of the poorest speller 
in the class. On the other hand, if the test were composed of 
twenty very difficult words, fifteen right might represent the per- 
formance of the best speller in the class. Thus it is apparent that 
raw scores or percentage scores by themselves are meaningless. An 
exception to this rule is found in tasks that are so standarized that 
the raw scores do represent meaningful quantities by themselves. 
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Such scores as “60 words per minute—typing” апа “10 seconds to 
run 100 yards” have common meaning throughout the country. 
Unlike these last two examples the meaning of “15 right out of 20” 
in spelling or 85 percent in arithmetic have no common meaning 
because of differences in the difficulty of material contained in 
various tests. 

Consider two pupils in the seventh grade. Johnny and Mary are 
in the same school but have different teachers. Although they study 
the same spelling words in both classes, Johnny’s teacher gives a 
test containing most of the easy words and few of the hard ones, 
while Mary’s teacher gives a test containing almost all hard words. 
Johnny receives a grade of 85 percent and Mary (who is actually 
a better speller than Johnny) receives a grade of 75 percent. Does 
this mean that Johnny actually spells better than Mary? Although 
this is probably the way in which Johnny's mother would interpret 
these two scores, it is obvious that these scores cannot be fairly 
compared. 

A better procedure than using raw scores or percentage scores is 
that of converting the raw scores into some kind of score which 
reflects the pupil’s position in the class. Suppose that a teacher has 
given a twenty-item spelling test to her class and has counted and 
recorded the number right on each paper. She does not want to re- 
cord either a raw score or a percentage score, and decides to assign 
letter grades to the test papers. Some teachers convert percentage 
scores into letter scores by such definitions as “90 percent to 100 
percent equals A.” This type of conversion accomplishes nothing 
since the basic objection to raw scores and percentage scores is not 
eliminated. A better way to assign letter grades is to make a fre- 
quency distribution of the test scores and then assign grades by 
cutting the distribution at those points which will result in a fair 
distribution of grades for the particular test. Table 4 on page 70 
is an example of a frequency distribution with letter grades assigned 
to the raw scores. 

Many teachers employ plus and minus signs to provide a greater 
range of grades. Thus, instead of just marking a paper B, it be- 
comes possible to mark it B, B+, or B—. 

Another method of assigning marks is to use a nine-point numeri- 
cal system. In this system nine represents the highest mark, five the 
average mark, and one the lowest mark. These nine numbers corre- 
spond to letter grades as shown in Table 5 on page 70. 
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Table 4 Assigning Grades from a Frequency Distribution 


Score on Number Receiving Letter 
Test Score Grade 
20 
19 1 A 
18 1 
— r ————— 
17 2 
16 3 B 
15 2 
-——— p ei 00555 
13 4 
12 5 с 
п 2 
10 1 
ee, 
9 2 
8 1 
7 1 D 
6 
5 
— mamm 
3 1 
2 F 
1 
0 
The nine- 


point system has the advantage that it does not use 


1s signs and that the marks recorded are numbers which 
ged at the end of the semester, 


plus or minu 
can be avera 


Table5 A Nine-point Marking System 


Number Letter equivalent 
9 A 
8 Between A and B 
7 B 
6 Between B and C 
5 C 
4 Between C and D 
3 D 
2 Between D and F 
1 F 


To use either the A, B, C, D, F s 


ystem or the nine-point system, 
the procedure is as follows: 
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І. Assign a raw score to each paper. This may be the number of 
questions right on a test or any other raw score obtained by mark- 
ing a test, product, or procedure. 

2. Make a frequency distribution of the raw scores. This consists 

of listing cach possible score on the test and then tallying the 

number of pupils who received each score. 

3. Determine which raw scores shall be equivalent to which letter 

grades or numerical grades. This determination is essentially a 

matter of teacher judgment rather than a statistical decision. 

Usually, more C's will be given than any other grade unless the 

group is special in some way. There will usually be more B's 

than A's and more D's than F's. When the group is atypical 

(either considerably better than average or considerably poorer 

than average) this rule will have to be modified. It is reasonable 

to expect particularly able classes to have a high percentage of 

A's and B's and particularly poor classes to have a high per- 

centage of C's, D's and F's. 

There is no statistically correct percentage of A's, B's, and so 
forth, that should be given by all teachers in all classes. "Teachers 
sometimes mistakingly attempt to grade "on the curve," that is, 
assign the same percentage of each letter grade in every class they 
teach. As was pointed out in the previous paragraph, the nature of 
the group must be considered in assigning grades so that no one 
system of percentages could possibly apply to all classes. There 
should, however, be as much agreement as possible between the 
grading policies of all the teachers in a school so that the kind of 
work that is marked “С” by one teacher will not be marked “A” by 
the teacher in the next room. 


Weighting test scores 


Usually not all tests or other assignments are of equal impor- 
tance. Unless some system of weighting the marks given on the 
various assignments is used, the mark on a five-minute quiz will 
have as much weight in determining the final mark as the score on a 
major examination. The simplest method of weighting test scores 
is to record the score one time if it is of minor importance and to 
record it two or more times for more important assignments. Thus, 
a grade of B on a five-minute quiz might be recorded as a B, a grade 
of C on a mid-term examination might be recorded twice (C, C) 
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and the grade of B— on the final examination might be recorded 
four times, (B—, B—, B—, B—). 


Summarizing grades 


If the teacher has recorded grades for various tests, assignments, 
and other class projects, then at the end of the marking period she 
will have the task of distilling from all the separate marks one sum- 
mary mark (in each subject) to use for reporting purposes. Sup- 


pose that at the end of the semester the teacher has the following 
grades for a student in spelling: 


В, C, D, B, E, А,В,С, С 
What final grade should this student receive in spelling? To answer 
this question requires that the grades be averaged in some way. One 
method of averaging grades is to assign some numerical equivalent 
to each grade, add up the total points earned, and then divide this 
total by the number of scores which were included in the total. The 
result of this procedure is technically known as the arithmetic mean. 
Most people call it the “average,” but since there is more than one 
kind of statistical “average,” it is necessary to be precise and refer to 
this particular kind of an average as a “mean.” To obtain the mean 
of the above grades in spelling, it is first necessary to convert the 
letter grades to numbers. In this case let A equal 5, B equal 4, C 
equal 3, D equal 2, and F equal 1. The pupil's scores in spelling 
can now be represented as: 
45,24, 15524. 959 

The total points earned by the student is 29, and since there are 
nine scores included in the total, the mean score is 3.2 (29 divided 


by 9). Since 3.2 is closer to 3 than to 4 the student receives a C in 
spelling on the basis of his mean score. 


Another kind of statistical average is the median. 
simply the middle score in the distribution when 
arranged in order from the highest to the lowest. 
spelling scores are arranged in this fashion they beco 


A, В, В, B, GC, G DF 
Since there are nine separate scores, the fifth score 
end) is the median. In the example a 
In case there are an even number of sc 
to be midway between the two middle 
The median has two adv. 


The median is 
the scores are 
If the pupil’s 
me: 


(from either 
bove, the median grade is С. 
ores, the median is considered 
scores of the distribution. 

antages over the mean. First, it is much 
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easier to compute, particularly when each pupil has a large number 
of separate grades. Second, it is more representative of the pupil’s 
typical behavior, being unaffected by extremely high or low scores 
as is the mean. 

The majority of teachers employ the mean in summarizing grades, 
but there is no reason why the median cannot be employed for this 
purpose. Its use is especially recommended in those instances where 
a large number of grades are to be averaged to arrive at a summary 
grade. 


EXERCISES 


1. Organize a panel and discuss the advantages and disadvantages 
of the A-B-C-D-F system of grading. 

2. Based on your own philosophy, write a description of an ideal 
grading system. 

8. Assign a committee to investigate different forms used locally 
for report cards and report on the merits of each. 

4. What are the arguments for and against grading on the basis of 
the child's ability? Do these same arguments apply to a system of 
grading on individual pupil growth? 

5. Find the mean of the following scores: 9, 7, 10, 8, 10, 8, 9, 9, 

6, 8, 4, 10,8. 

Find the median of the scores in Question 5. 

A social studies pupil was given the following marks during one 

week: 5-minute quiz, C—; midterm exam, B; 5-minute oral 

report, C. Should these marks receive equal weight, and if not, 
how would you weight them? 

8. Assume that as a teacher you are scoring final exams; discuss how 
you would determine the dividing line between the C's and B's 
and between the C's and D's. 


SEL 
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Strang, R. Reporting to Parents. 
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Wrinkle, W.L. Improving Marking and Reporting Practices in 
Elementary and Secondary Schools. 


CHAPTER 7 


Evaluating Achievement 
with Standardized Tests 


A large number of achievement tests are available from com- 
mercial publishers. These tests are usually referred to as standard- 
ized achievement tests. The word “standardized” is used to desig- 
nate the fact that the tests are administered, scored, and inter- 
preted in a standard way and that the tests are accompanied by 
"norms." Norms are records of the performances made by groups 
of individuals who have previously taken the test. They are used 
as а means of determining how the score of any individual who 
takes the test compares with scores made by other persons. 

Standardized achievement tests differ from teacher-made achieve- 
ment tests in several ways. Since the standardized achievement test 
is designed for use in many different school systems throughout the 
country, the content must be based on broad objectives common to 
many different courses of study in the area being tested. Teacher- 
made achievement tests, on the other hand, are designed to measure 
the attainment of objectives in a specific class at a particular time. 
Standardized achievement tests are usually more expertly designed 
and constructed than the typical teacher-constructed exam. The 
financial returns accruing from the sale of tests enable the test- 
maker to spend considerable time and money in item writing, in 
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item selection, and on various kinds of research necessary to im- 
prove the test. 

Although teacher-made achievement tests are more appropriate 
for evaluating specific learning within a given class, standardized 
achievement tests can be helpful in providing information regard- 
ing pupils’ status and progress in broad educational objectives. 


Tests of basic skills 


Among the most useful standardized achievement tests are those 
in the area of basic skills, such as reading, language, and arithmetic. 
Since many different schools have similar objectives and teach simi- 
lar materials in the basic skills areas, standardized achievement tests 
in these areas are widely used. However, in spite of the basic agree- 
ment on objectives in the areas, the contents of the various stand- 
ardized achievement tests in these areas vary widely among them- 
selves. The only method of determining whether a particular stand- 
ardized test in any one of the basic skills is suitable for a particular 
school or school system is to examine the test and compare it with 
the local objectives. 


Tests in content areas 


There are many achievement tests which cover the content of non- 
skill areas, such as social studies and science. However, there is great 
variation in the content covered in various classes and school sys- 
tems in these areas. This fact has made standardized achievement 
tests of content unusable in many situations. Recognizing this great 
variability in content, test-makers have devised tests which measure 
pupils’ ability to apply school learnings to the solution of new prob- 
lems. Two tests of this type are described in the next section. This 
approach is becoming increasingly popular because there is more 
agreement on the over-all broad objectives of education in these 
areas than there is on the content of any one course which con- 
tributes to the attainment of the over-all objectives. 


Tests of broad educational objectives 


One of the first and most widely used tests of broad educational 
objectives is the Jowa Tests of Educational Development (Science 
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Research Associates). The tests are devised to yield evidence of the 
degree to which concepts are understood other than the degree to 
which isolated facts are recalled. The series is designed for grades 
8.5 to 13.5 and yields ten scores: understanding of basic social con- 
cepts, general background in the natural sciences, correctness and 
appropriateness of expression, ability to do quantitative thinking, 
interpretation of reading materials in the social studies, interpre- 
tation of reading materials in the natural sciences, interpretation 
of literary materials, general vocabulary, the subtotal of these eight 
tests, and using sources of information. The total series of tests 
requires approximately eight hours of testing time. 

The latest and most ambitious series of tests designed to measure 
broad educational objectives is the Sequential Tests of Educational 
Progress (Cooperative Test Division, Education Testing Service) - 
This series consists of tests of: essay writing, listening comprehen- 
sion, reading comprehension, writing, science, mathematics, and 
social studies. The tests emphasize measurement of the ability to 
apply school-learned skills in solving new problems.’ Each of the 
tests is available for four different levels—grades 4-6, 6-9, 10-12, 


13-14. The total series of tests requires approximately seven and 
one-half hours of testing time. 


Achievement test series 


Most of the major publishers publish achievement tests series 
which include achievement tests in several areas and covering sev- 
eral different educational levels. "Typical achievement test series 
are listed in Table 6 on page 78. These series have norms based on 
the same or on comparable populations. "This fact enables the test- 
user to compare the performance of a pupil or a class in one subject, 


such as reading, with the performance in any other subject covered 
by the test, e.g., arithmetic. 


Diagnostic tests 


Most of the tests referred to in the preceding sections have been 
"survey" tests—that is, they cover a broad area and result in a total 
score which reflects over-all achievement in the area tested. Thus 


1 See Chapter 2 for typical items from this series, 
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teachers can say that a pupil is doing well in arithmetic or doing 
poorly in arithmetic; but they do not know why, nor do they know 
what arithmetic concepts are causing difficulty. "There are tests 
which have been devised to provide information about the specific 
nature of a pupil's difficulties in given subject areas. These tests 
are called diagnostic tests. A diagnostic test in division of whole 
numbers may reveal that the pupil is having trouble with problems 
involving zero but that otherwise he can perform the division 
process adequately. There are a number of good diagnostic tests 
on the market, particularly for reading and arithmetic. Any test 
can be used as a diagnostic test in a limited way by examining the 
students' performance on the individual items which make up the 
test rather than on the test as a whole. However, the typical survey 
test does not include sufficient items in any one area to enable the 
test to be used successfully as a diagnostic instrument. 


Test norms 


Standardized tests are given under "standard conditions," which 
means that the same set of directions is given to all students who 
take the test and that the same time limits are imposed. Therefore, 
the performance of individuals taking the test at widely different 
times and places can be compared. To provide a method whereby 
an individual's performance on a standardized test can be compared 
with that of other individuals, the standardized test usually fur- 
nishes "norms." The word norm means average. The "norms" on 
a test are the averages of various groups who have taken the test. 

By using different kinds of norms, such statements can be made 
as, “Johnny reads at the 5.8 grade level" (reads as well as the aver- 
age child in the eighth month of the fifth grade) , and “Jane reads 
at the 70th percentile for ninth grade students” (reads better than 
70 percent of the children in the ninth grade). The statement 
about Johnny’s reading ability was based on “grade norms,” while 
the statement about Jane's reading ability was based on “percentile 
norm.” Although grade norms are widely used, particularly at the 
clementary school level, percentile norms are generally more useful. 

The norms furnished with a test are valuable only if a comparison 
is to be made with pupils from other schools. Since norms are 
based on an average of various regions of the country, large and 
small schools, pupils of various background, and schools with vary- 
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Table 6 Representative Achievement Test Series 


Test and Publisher 


Grade Levels 


Tests Included 


California Achievement Tests, 1957 
Edition 
(California Test Bureau) 


Coordinated Scales of Attainment 
(Educational Test Bureau) 


Essential High School Content Battery 
(World Book Company) 


Iowa Every-Pupil Tests of Basic Skills 
(Houghton Mifflin Company) 


Iowa Tests of Educational 
Development 
(Science Research Associates) 


Lower Primary (1-2) 
Upper Primary (3-L4) 
Elementary (4-6) 
Junior High (7-9) 
Advanced (9-14) 


1-8 (Separate form for 
each grade level) 


9-13 
Elementary (3-5) 
Advanced (5-9) 


8.521355. 


Reading, Arithmetic, language. 


Reading, arithmetic, and spelling in the forms for grades 
1, 2, and 3. Language, history, geography, science, and 
literature are added for grades 4 and above. 


Mathematics, science, social studies, English 


Silent reading comprehension, work-study skills, basic 
language skills, basic arithmetic skills, 


Understanding of basic social concepts; general back- 
ground in the natural sciences; correctness and appro- 
priateness of expression; ability to do quantitative 
thinking; interpretation of reading materials in the 
social studies; interpretation of reading materials in the 
natural sciences; interpretation of literary materials; 
general vocabulary; using sources of information. 
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Metropolitan Achievement Tests 
(World Book Company) 


Sequential Tests of Educational 
Progress 

(Cooperative Test Division—Educa- 
tional Testing Service) 


SRA Achievement Series 
(Science Research Associates) 


Stanford Achievement Tests 
(World Book Company) 


Primary I (1) 
Primary II (2) 
Elementary (3-4) 
Intermediate (5-7.5) 
Advanced (7-9.5) 


Level 4 (4-6) 
Level 3 (7-9) 
Level 2 (10-12) 
Level 1 (13-14) 


Grades 2-4 
Grades 4-6 
Grades 6-9 


Primary (1.9-3.5) 
Elementary (3.0-4.9) 
Intermediate (5-6) 
Advanced (7-9) 


Reading, vocabulary, and arithmetic included at all 
levels. Spelling for grade 2 and above. Language added 
for grade 3 and above. Geography, history, and science 
added for grade 5 and above. 


Essay writing, listening comprehension, reading com- 
prehension, writing, science, mathematics, social studies. 


Reading, arithmetic, and language arts included at all 
levels. Language perception added to grades 2-4. 
Work-study skills added for grades 4-6 and 6-9. 


Reading, spelling, and arithmetic included at all levels. 
Language added .for elementary battery and above. 
Social studies, science, and study skills added in the 
intermediate and advanced batteries. 
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ing educational objectives, it is very difficult to determine whether 
the particular norms furnished with a test are ones which can 
reasonably be employed. The words “national norm” are often 
heard in connection with standardized tests. The words are mis- 
leading to the extent that there is no one “national norm” em- 
ployed by all makers of standardized tests. The norms furnished 
with the various achievement test series all differ from each other. 
Sometimes the differences are quite large. The only way of deter- 
mining whether the norms supplied with a particular test are appro- 
priate for use in a given situation is to study the information con- 
tained in the manual. 

If the content of an achievement test is useful for measuring local 
objectives, but the norms furnished with the test are not appro- 
priate, local norms may be developed. These are norms based on 
the test scores of pupils in one particular school or school system. 
These local norms provide a means of comparing the achievement 
of pupils within the local system with each other as well as compar- 
ing new students coming into the school with those already there. 


Test profiles 


A test profile is a graphical representation of the scores on various 

tests given to an individual. By graphically representing these 
ScOres, it is possible to identify strengths and weaknesses of the 
individual student. That is, inspection of the profile may reveal 
the fact that an individual is strong in reading but weak in arith- 
metic. Before test profiles can be plotted, it is necessary for the 
norms for the various tests to be comparable. Unless the norm 
populations are comparable, preferably based on the same popu- 
lation, it is not valid to compare an individual’s standing on one 
test with that on a different test. The meaning of “stands at the 
70th percentile in reading” is quite different when the norm group 
consists of college preparatory high school students in the eleventh 
grade than when it consists of all high school students in the 
eleventh grade. 

Profiles have traditionally been plotted by depicting each of the 
separate test scores as a point. This procedure overemphasizes small 
differences between achievement in the different areas. A better 
procedure is to represent each of the tests by a shaded band extend- 
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ST SCORE EROEILE 


Sex: 


Scores of:. е ——- 


Interpretation: Scores profiled here аге bands rather — If the bands of the student's verbal and quantitative 
than points. The midpoint of each band shows ap- scores overlap, there is probably no important differ- 
proximately what percentage of students in the norm- — ence between the scores. If the two bands do not over- 
ing group earned scores lower than the onc profiled. Іар, the chances are about 5-10-1 that there is a real 
Each band covers two standard errors of measure- difference in measured ability present. (See Manual 
ment, one above and one below the percentile rank for additional information on interpretation.) 

score earned. This means that the chances are 2-to-1 

that the student's “truc” score lies within the range 

of the band. 
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ing one standard error of measurement? above and below the 
obtained score. This band serves as a reminder that test scores 
are not precise and prevents the overemphasis of small differences 
in achievement which may have no educational significance. The 
tests profiled on page 81? illustrate this procedure. 


EXERCISES 


l. Inspect the items on an achievement test in an area of your 


choice, and attempt to list the general objectives of the test- 
maker. 


2. Outline the objectives of some course you are teaching or expect 
to teach in the future. Obtain a standardized achievement test 
whose title would lead you to think it might be useful in evalu- 


ating these objectives. Compare your objectives with the content 
of the test. 


3. In the manual of the test you used in Exercise 2, turn to the 
section which describes the sample on which the norms w 
based. Is there sufficient description of the sample to en 
to decide whether this is the type of group with w 
to compare your students? 


ere 
able you 
hich you want 


4. Compare two well-known standardized achievement tests in your 
area to see how they differ. Identify, if you can, apparent differ- 
ences the test-makers had in mind when they wrote the test. 


5. Interview a counselor in a school which uses a test such as the 
Iowa Test of Educational Development to determine such things 
as, how often it is given, who takes the test, how are the results 
used, and how is the school organized to administer the test. 


6. How is it possible for a child in the fifth grade to place at the 
eighth-grade level on a standardized arithmetic test, without 


being able to work all the types of arithmetic problems taught 
at the eighth-grade level? 


? For an explanation of the standard error of measurement see Appendix B. 


page 109. 


3 Reproduced from Examiner's Manual for Cooperative School and College 
Ability Tests by permission of the Cooperative Test Division, Educational Test- 
ing Service, Princeton, New Jersey, and Los Angeles, Calif. 
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Thorndike, R.L., and Hagen, E. Measurement and Evaluation in 


Psychology and Education. 
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CHAPTER 8 


Evaluating Abilities with 
Standardized Tests 


Closely related to achievement tests is a group of tests known 
as intelligence tests or tests of scholastic aptitude. These tests are 
designed to measure the pupil’s capacity to profit from school 
instruction. 

Achievement, intelligence, and aptitude tests are closely related 
to each other. Although the three categories traditionally 
have been thought of as separate and distinct, the modern view is 
to minimize differences between the labels of tests in any of these 
areas, and to select and use the tests on the basis of their validity 
for the particular job to be done. In a recent article Alexander G. 
Wesman' points out the overlap between these three "types" of tests. 

"By definition, an achievement test measures what the exam- 
inee has learned. But an intelligence test measures what the 
examinee has learned. And an aptitude test measures what the 
examinee has learned. So far, no difference is revealed. Yet three 
of the traditional categories into which tests are classified are 
intelligence, aptitude, and achievement. Now these categories are 
very handy; they permit publishers to divide their catalogs into 
logical segments, and provide textbook authors with convenient 


! Test. Service Bulletin #51 (N 


€w York: The Psychological Corporation, De- 
cember, 1956). 
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chapter headings. Unfortunately, the categories represent so 
much oversimplification as to cause confusion as to what is being 
measured. What all three kinds of tests measure is what the 
subject has learned. The ability to answer a proverbs item is no 
more a part of the examinee’s heredity than is the ability to 
respond to an item in a mechanical comprehension test or in a 
social studies test. All are learned behavior. 

Moreover, all are intelligent behavior. It takes intelligence to 
supply the missing number in a number series problem. It also 
requires intelligence to figure out which pulley will be most 
efficient, or to remember which president proposed an inter- 
American doctrine. We can say, then, that an intelligence test 
measures intelligent behavior, an aptitude test measures intelli- 
gent behavior and an achievement test measures intelligent 
behavior. 

Finally, all three types of tests measure probability of future 
learning or performance, which is what we generally mean when 
we speak of "aptitude." In business and industry, the chances 
that an employee will profit from training or will perform new 
duties capably may be predicted by scores on an intelligence 
test, by scores on one or more specific aptitude tests, or by some 
measure of the degree of skill the employee already possesses. 
Similarly, test users in the schools know that an intelligence 
test is usually a good instrument for predicting English grades, 
a social studies test is often helpful for prediction of future 
grades in social studies, and a mechanical comprehension test is 
likely to be useful in predicting for scientific or technical courses. 
So, intelligence tests are aptitude tests, achievement tests are 
aptitude tests, and aptitude tests are aptitude tests." 


What intelligence tests measure 


Intelligence tests evaluate primarily an individual's ability to 
learn the materials taught in school. This was their original pur- 
pose and continues to be their main function, although they are 
also used for other purposes. It is widely thought that intelligence, 
as measured by intelligence tests, does not depend on the extent of a 
person's schooling or other experiences. This is a complete miscon- 
ception. Thus, identical twins exposed to completely different 
environments, one rich in opportunities for intellectual stimulation, 
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and the other devoid of such opportunities, can be expected to vary 
considerably in their scores on an intelligence test. The one with 
the richer environment will test significantly higher on a typical 
intelligence test than his less fortunate twin. This fact must always 
be kept in mind in interpreting the results of intelligence test scores. 

Intelligence tests are designed to predict the ability of pupils to 
profit from school instruction, and for this purpose they are fairly 
competent, if school instruction js thought .of as limited to the 
academic areas. However, they are not good predictors of how well 
a student may perform in non-academic areas such as woodworking, 
art, and music. Neither are they an indication of a person’s over- 
all worth. Work habits, study habits, perseverence, integrity, and 
many other important facets of the individual's personality are not 
revealed by the LQ. As a result, it is not uncommon to find a pupil 
achieving more than his classmate, even though his I.Q. is not as 
high as his classmate's. The intelligence test indicates how well the 
child may achieve in school; it does not guarantee that performance. 


Types of intelligence tests 


The earliest intelligence tests were d 
one person at a time. This type of 
intelligence test. After intelligence 
accepted, group tests of intelligence y 
the standardized achievement test 


evised to be administered to 
test is called an individual 
tests came to be generally 
vere devised, which resemble 
and can be administered to more 
than one pupil at a time. Because of ease of administration and 
the lower cost, most of the intelligence tests given in our schools 
are group tests rather than individual intelligence tests, 

Group intelligence tests depend on reading ability; accordingly, 
Pupils who are poor readers will obtain a lower score on a group 
intelligence test than other pupils of equal inherited ability who are 
better readers. "There are, however, tests of intelligence th. 
pletely non-verbal. These tests consist of performance items and are 
called performance or non-language tests of intelligence. This type 
of test is not as good an indicator of the pupil's ability to do school 
work as the tests of intelligence which involv 
However, they provide a useful means of obt 
the intelligence. of pupils who have a lan 
because they are poor 
native tongue. 


at are com- 


e the use of language. 
aining evidence about 
guage problem, either 
readers or because English is not their 
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Since group intelligence tests depend heavily upon language, 
whenever a pupil scores very low on a group intelligence test fur- 
ther evidence should be obtained to shed light upon the question 
of whether the low score really reflects a lack of basic ability to do 
school work, or whether some other factor was operating to lower 
the score. This further evidence can consist of an individual intel- 
ligence test, a non-language test of intelligence, or the pupil's actual 
performance in the classroom. 

Some intelligence tests yield more than one I.Q.—for example, a 
verbal I.Q. and a non-verbal I.Q. When such a test is used, large 
differences between the two LQ.'s can reveal the existence of a 
language handicap. 


Numerical expression of intelligence 


The exact meaning of a numerical score which represents the 
I.Q. depends upon the particular test which was used. The term 
LQ. was first used in connection with the Stanford-Binet test of 

Mental Age 
E sot lb "bus, 
Chronological Age 


if a pupil had a mental age of 120 months, as measured by the test, 
120 
and a chronological age of 80 months, his I. Q. would be-go- x 


100 which equals 150. Whenever the mental age is exactly equal 
to the chronological age, the 1.Q. is 100. Because the term “1.0.” 
was so well known, it was borrowed by other test-makers to use in 
reporting scores on their intelligence tests. However, the 1.Q.’s 
obtained from many of these other tests are not derived by dividing 
a mental age by a chronological age. 

In spite of the fact that numerical scores derived from different 
tests of intelligence are not exactly comparable, they all have a 
certain similarity. This fact enables the test-user to arrive at a crude 
estimate of a pupil’s ability to do school work, even though the 
exact meaning of the test score may not be known. On all tests 
which express intelligence in terms of an I.Q., the average score 
is approximately 100. However, since the scores are crude measures, 
a better way of thinking of the average is to consider any score in 
the range from approximately 90 to 110 to be average. No emphasis 
should be placed on the exact score earned by an individual, rather 


intelligence. It was defined as 
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the score should be thought of as reflecting a general level of 
capacity to profit from school instruction. 


Homogeneous grouping 


Intelligence test results have Sometimes been used to create classes 
whose members are similar in intelligence. For example, suppose 
that there were five fourth-grade classes in an elementary school. 
Strict homogeneous grouping by intelligence would require deter- 


the pupils into five Sroups, assigning the top fifth of the pupils to 
one class, the next fifth to another class, and so forth. 


The purpose of grouping pupils in this fashion js to reduce the 
variability of ability in each 


tion and learning both easier and more effective, Cun 


Using intelligence tests scores 


The intelligence test Provides the teach 
mining the over-all ability of a pupil 


—————————— 2 $2 NEL. o 


| 


| 


| 
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Г.О. may not be as outstanding in art, music, physical education, 
work involving manual dexterity, or other areas of school work 
which are not highly academic in nature, as he is in the more 
academic subjects. 

4. The results of intelligence tests given more than two or three 
years previously should not be relied on too heavily. Since meas- 
ured I.Q. depends on the pupil’s experiences, there may be a 
considerable shift in measured І.О. in that time. 

5. Personality and character are not completely reflected by the 
І.О. There are many worth-while persons in our society who have 
relatively low I.Q.’s, and many persons in prisons who have high 
1.Q.’s. 

6. Two persons with the same І.О. may have entirely different 
strengths and weakness, even in intellectual areas. 1.Q. tests 
usually include verbal items (such as vocabulary) and quantita- 
tive items (such as arithmetic). It is possible for two individuals 
to achieve the same total 1.0. score with one individual much 
stronger in the verbal area than in the quantitative area, and the 
other individual much stronger in quantitative than in verbal. 

7. If other available evidence is not consistent with a pupil's 
score on an intelligence test, investigate to discover the reason for 
the inconsistency. It may be that the 1.О. test given to the child 
was, for some reason, not valid. 


"Tests of special aptitudes 


Although the I.Q. is a measure of the general aptitude of the 
student to profit from school instruction, some areas of school 
instruction are more closely related to the LQ. than are others. 
This fact has led to the development of tests designed to predict 
success in some particular aspect of the curriculum with more 
validity than that achieved by using the general intelligence test. 
Special aptitude tests have been developed for art, music, algebra, 
foreign language, shorthand, reading, and other school subjects. 
The most widely used prognostic tests are those of "reading readi- 
ness." These tests are designed to be given to students to determine 
whether they are sufficiently mature to profit from formal instruc- 
tion in reading. 

In recent years a number of aptitude test batteries have appeared 
on the market. The batteries consist of a number of tests of differ- 
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Table 7 Representative Standardized Tests of Ability 


Test and Publisher 


Grade or Age Levels 


Type 


Scores Reported 


California Test of Mental Maturity 
1957 Edition 
(California Test Bureau) 


Chicago Tests of Primary Mental 
Abilities 
(Science Research Associates) 


Differential Aptitude Test Battery 
(Psychological Corporation) 


Kuhlman-Anderson Intelligence Test— 
Sixth Edition 
(Personnel Press, Inc.) 


Otis Quick-Scoring Mental Ability 
Tests: New Edition 
(World Book Company) 


Grades: Kgn.-1; 1-3; 
4-8; 7-9; 9-13; 10- 
College; Adult 


Ages 11-17 


Grades 8-12 


Grades: Kgn., 1, 2, 3, 
4, 5, 6, 7-8, 9-12 


Grades 1-4; 4-9; High 
School and College 


Group 


Group 


Group 


Group 


Group 


Language I.Q., non-language IQ. 


Total ГО. 


Number, verbal meaning, space, word 
fluency, reasoning, memory 


Verbal reasoning, numerical ability, 
abstract reasoning, space relations, 
mechanical reasoning, clerical speed 
and accuracy, language usage 


Total І.О. 


Total I.Q. 


16 


Pintner General Ability Tests—Non- 
Language Series 
(World Book Company) 


Pintner General Ability Tests—Verbal 


Series 
(World Book Company) 


Revised Stanford-Binet Scale— 
1937 Edition 
(Houghton-Mifflin) 


School and College Ability Tests 
(Cooperative Test Division, Educational 
Testing Service) 


Terman-McNemar Test of Mental 
Ability 
(World Book Company) 


Wechsler Adult Intelligence Scale 
(Psychological Corporation) 


Wechsler Intelligence Scale for Children 
(Psychological Corporation) 


Grades 4-9 


Grades: Kgn.-2; 2.5- 


4.5; 4.5-9.5; 9 and 
above 


Ages 2-adult 


Grades 4-6; 6-8; 8-10; 
10-12; 13-14 


Grades 7-12 


Ages 16 and up 


Ages 5-15 


Group 


Group 


Individual 


Group 


Group 


Individual 


Individual 


Non-language І.О. 


"Total I.Q. 


Total І.О. 
Verbal, quantitative, total 
"Total I.Q. 


Verbal LQ., performance I.Q., total 
LQ. 


Verbal I.Q., performance I.Q., total 
LQ. 
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ent aptitudes designed to provide a profile of the pupil’s ш 
and weaknesses within the intellectual domain. Their d VERUM 
tage is that they provide norms for the separate tests in the PH 
based on the same population. This permits the comparison on Pus 
relative standing of an individual in the various areas covere : y 
the tests. Such comparisons cannot be made with tests whose norms 
have been developed on different populations. The major problem 
is that of determining the validity of each separate score or com-: 
bination of scores for each purpose for which the battery is to 
be used. | 
Persons contemplating the use of special aptitude tests or aptitude 
batteries should carefully examine the evidence about their validity 
to determine whether the test or battery is actually an improvement 
over a measure of general intelligence. 


EXERCISES 


1. List several reasons why a pupil might be achieving higher than 
his І.О. would indicate. 


2. Obtain a group test of intelli 
test to determine what 
achievement items. 


gence. Examine the items in the 
Percent of them can be classified as 


methods since anyone with an LO. of 115 must be able to learn 
Latin. Do you think the counselor is right? Why? 

4. If a teacher wanted to group her class for instruction in arith- 
metic, which of the following instruments would be most likely 
to produce groups alike in their present standing in arithmetic: 
an individual intelligence test; a group non-verbal intelligence 
test; a group arithmetic achievement test? Why? 

5. When Mary was 7 years old her LQ. was 102: when she was 10 
years old her I.Q. was 89 and when she was 13 years old her I.Q. 
was 110. How can the differences between these three 1.Q.’s be 


accounted for? What I.Q. might be obtained if Mary were to 
be tested when she was 15 years old? 
6. Obtain from school records the I. 


Q.’s of several pupils who have 
been given two or more tests of i 


ntelligence. Compare the differ- 


Suggested additional readings 93 


ent I.Q.’s for each pupil to determine how constant their 1.Q.’s 
were as measured by intelligence tests. 


SUGGESTED ADDITIONAL READINGS 
Anastasi, A. Psychological Testing. 


` Cronbach, L.J. Essentials of Psychological Testing. 


Freeman, F.S. Theory and Practice of Psychological Testing. 

Goodenough, F.L. Mental Testing. 

Thorndike, R.L., and Hagen, E. Measurement and Evaluation in 
Psychology and Education. 


CHAPTER 9 


Evaluating Interest and Adjustment 
with Standardized Instruments 


In working with pupils it is often helpful to know something 
about their interests and adjustment. Observations and inter- 
views, both formal and informal, are the traditional techniques for 
collecting this type of data. As an adjunct to observation and inter- 
view, paper-and-pencil instruments have been developed to obtain 
information about the pupil's interests and adjustment in a more 
economical way. These paper-and-pencil instruments consist of 
standard sets of questions administered and scored in a standardized 
way. The fact that they resemble standardized tests so closely has 
led some persons to mistakenly classify them as standardized tests, 
whereas the correct name for instruments of this type is question- 
naire or inventory. A test consists of questions which have right or 
wrong answers. Inventories consist of questions which have no right 
answers, but which merely report the student's feelings, preferences, 
or actions. 

The validity of all standardized inventories or questionnaires 
depends on the pupil's willingness to reveal himself frankly and his 
ability to see himself as he is. For this reason self-report inventories 
have their greatest use in situations where the individual has no 
reason to fake his answers, since all self-report inventories are 
fakable to some degree. When used by persons who are aware of 
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their limitations, standardized inventories can provide useful facts 
about pupils. 


Interest inventories 


Interest inventories are designed to investigate the pupil's inter- 
ests in and feeling toward occupations and professions and various 
activities. "This evidence then becomes useful in vocational coun- 
seling. It can also be an aid in determining what kinds of class- 
room activities and educational experiences will interest the pupil. 
The major types of interest inventories on the market are illustrated 
by the Strong Vocational Interest Blank, the Kuder Preference 
Record (Vocational), and What I Like to Do. 

The Strong Vocational Interest Blank for Men (Stanford Uni- 
versity Press) is one of the best-known instruments for appraising 
interests. Consisting of over 400 items, the inventory can be scored 
for forty-seven occupations, and for six groups of occupations. A 
separate form of the Strong is available for women and includes 
scoring for twenty-eight occupations. The Strong is most useful 
when a pupil has expressed the desire to see whether his interests 
coincide with the interests of persons in one of the forty-seven 
occupations for which scoring keys are available. 

The Kuder Preference Record-Vocational (Science Research 
Associates) has several different forms. Form B measures nine 
interest areas: mechanical, computational, scientific, persuasive, 
artistic, literary, musical, social service, and clerical. Form C in- 
cludes all of these scales plus an outdoor scale and a verification 
score (used to identify those who have not followed directions or 
who have answered carelessly). These two forms are most useful 
in identifying general occupational areas for further study and 
exploration for pupils who have no definite occupational plans at 
the time they take the inventory. A new form of the Kuder (Form 
D), now available, measures the individual's interest in specific 
occupations, as does the Strong. At present scoring keys are avail- 
able for twelve specific occupations. 

Both the Strong and the Kuder are used with high school stu- 
dents, college students, and adults. 

A new interest inventory for grades 4 through 7 is the inventory 
What I Like To Do (Science Research Associates). This interest 
inventory measures interests in eight areas: art, music, social studies, 
active play, quiet play, manual arts, home arts, and science. The 
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inventory is designed to help teachers in planning suitable instruc- 
tional activities and in guiding and counseling individual pupils. 
Interest should not be confused with ability. A pupil may be 
interested in an activity or an occupation but have little ability in 
that activity or occupation. Counselors or teachers should always 
consider both ability and interest when counseling pupils. 


Adjustment inventories 


Adjustment is defined as the degree of ability to fit into and live 
happily in one’s environment. Numerous commercial instruments 
attempt to measure adjustment. However, this type of inventory 
must be used very cautiously, and at best it gives only clues or 
indications of problem areas. 

Typical inventories in this area are the SRA Youth Inventory, 
The Bell Adjustment Inventory, and the Minnesota Multiphasic 
Personality Inventory (MMPI). 

‘ The SRA Youth Inventory (Science Research Associates) helps 
identify problems that young people worry about most. The eight 
areas covered are: My School, Looking Ahead, About Myself, Get- 
ting Along with Others, My Home and Family, Boy Meets Girl, 
Health, and Things in General. This inventory is designed for 
grades 7 through 12. 

The Bell Adjustment Inventor 
yields scores in four areas—home, health, social, and emotional. It 
also helps identify problems of concern to young people. 

The Minnesota Multiphasic Personality Inventory (Psychologi- 
cal Corporation) is designed to identify a number of distinct cate- 
gories of abnormal behavior. It has ten scales: Hypochondriasis, 
Depression, Hysteria, Psychopathic Deviate, Masculinity and Fem- 
ininity, Paranoia, Psychasthenia, Schizophrenia, Hypomania, and 
Social Introversion. The MMPI represents a type of inventory 

which should be used only by pe 


tsons with advanced training in 
psychology. It is mentioned here merely to indicate that instru- 


ments of this type exist and to point out the fact that such instru- 


ments should not be employed by classroom teachers or others not 
especially qualified in their use. 


Y (Stanford University Press) 


Summary evaluation of interest and adjustment inventories 


Self-report inventories of interest and adjustment are primarily 
instruments to be used in understanding and counseling students, 
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although interest inventories do have some use in planning class- 
room activities. Both types of inventories are subject to the limi- 
tations that the pupil must (1) be willing to give truthful answers 
and (2) have self-insight and self-understanding. 

Interest inventories are widely used in secondary schools and 
colleges. When used together with measures of ability and other 
information about the pupil, they provide valuable material for 
use in vocational counseling. 

Adjustment inventories have not been adopted as widely as inter- 
est inventories, nor is there as much evidence about their validity. 
Some authorities in the field of measurement and evaluation go so 
far as to recommend that they not be used at all. While this recom- 
mendation may be somewhat too harsh, it does seem clear that 
instruments of this type cannot be used by persons without consid- 
erable special training, and probably should not be used by class: 
room teachers. 


EXERCISES 


1. Obtain a copy of the Strong Vocational Interest Blank and the 
Kuder Preference Record-Vocational, and compare the types of 
items. 

2. List and discuss some precautions that should be taken when 
selecting , administering and interpreting the results of an inven- 
tory which is known to be “‘fakable.” 

3. Locate a research study which used an adjustment inventory. 
Note the way the instrument was employed and the precautions 
taken in interpreting the findings. 


SUGGESTED ADDITIONAL READINGS 


Ferguson, L.W. Personality Measurement. 

Thorndike, R.L., and Hagen, E. Measurement and Evaluation in 
Psychology and Education. 

Vernon, P.E., Personality Tests and Assessments. 


CHAPTER 10 


How to Select a Standardized 
Test or Inventory 


Basically the problem of buying a test is the 
in buying any other item. The first step is t 
tions or purposes to be served by the test or inventory to be pur- 
chased. Next, a number of tests or inventories which seem to be 
potentially useful for these purposes must be located. Finally, 


the one test or inventory which best seems to fit the purposes is 
selected. 


same as that involved 
o determine the func- 


Determining the purposes of the test or inventory 


Unless the purposes to be served by a test or inventory can be 
stated in concrete terms, there is probably no point in its pur- 
chase. Therefore, the first step should be to write down the pur- 
pose or purposes to be served by the test or inventory. These 
should be as explicit as possible. If an achievement test is needed 
to evaluate learnings in some area, the specific learnings which the 
test is expected to cover should be listed. If an interest inventory 
is being considered for use in a guidance program, a list should be 
made of the exact ways in which the inventory is to be used. 
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Locating likely tests or inventories 


Alter the objectives have been clearly determined, the next step 
is to locate tests which potentially are useful in fulfilling the objec- 
tives. The best single source of information about tests and inven- 
tories is the Fourth Mental Measurements Yearbook. This book 
contains descriptions of hundreds of published tests as well as 
critical reviews of many of them. No search for likely tests is com- 
plete until this book has been consulted. Since new tests continually 
become available and the Mental Measurement Yearbooks are pub- 
lished at irregular intervals, there may be some new tests or inven- 
tories which are not yet listed in this source. To locate these new 
tests a search should be made in the catalogs of test publishers. 
These catalogs are the most complete list of the available tests and 
inventories. A list of publishers who issue catalogs is included at 


the end of this chapter. 


Obtaining specimen copies of the most likely 
tests or inventories 


From the descriptions and reviews in the Mental Measurements 
Yearbook and from description contained in the publishers catalogs, 
the names of those tests or inventories which scem most likely to ful- 
fill the objectives are determined. Next, specimen sets of these tests 
or inventories are ordered from the test publishers. A specimen set 
consists of a copy of the test or inventory, an answer key, a manual, 
and all other material usually supplied with the test or inventory. 
It can be purchased for a nominal sum (usually 50 cents or less) . 
Before ordering the specimen set, check the restrictions contained in 
the test catalog on who may order test materials to be sure that the 
order will be filled. Usually school administrators or college teach- 


ers can authorize the purchase of specimen sets. 


Evaluating the specimen copies 


The major consideration in selecting a test is its validity, the 
extent to which it measures what the user wants it to measure. In 
selecting an achievement test its validity can be determined by 


1 Buros, О. K. (Editor) The Fourth Mental Measurements Yearbook. 
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analyzing the test, item by item, to determine if the individual 
items measure the specified objectives. A simple method of doing 
this is to compile a list of objectives and classify each item in the 
test according to which objective or objectives it measures. In 585 
ting up the list of objectives include a classification of “Irrelevant 
since some of the items in the test may not measure any of the 
objectives and should be considered as irrelevant to the stated objec- 
tives. The test which best covers the objectives and has the smallest 
percentage of irrelevant items is the best test for the purpose. The 
procedure used in selecting an intelligence test, an interest inven- 
tory or a measure of typical behavior is somewhat different. In these 
cases the validity of the test must be determined through examina- 
~ tion of the empirical (statistical) evidence furnished in the test 
manual. In any case, it is advisable for the prospective user to 
actually take each test under consideration. This procedure is 
valuable aid in becoming acquainted with a new test. 

If the search reveals several differe 
equally valid for the intended 
between them on the basis of r 
tion, ease of scoring, format, 
which have a bearing on the sele 


a 


nt tests which seem to be 
purpose, the choice may be made 
eliability, cost, time for administra- 


or other secondary considerations 
ction. 


Free advice on test selection 


American test publishers who issue catalogs 


Acorn Publishing Co., Inc., Rockville Centre, New York. 

Bureau of Educational Measurements, Kansas State Teachers Col- 
lege of Emporia, Emporia, Kansas. 

Bureau of Educational Research 


and Service, State University of 
Iowa, Iowa City, Iowa. 
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Bureau of Publications, Teachers College, Columbia University, 
New York 27, New York. 

California Test Bureau, 5916 Hollywood Boulevard, Los Angeles 
28, California. 

Center for Psychological Service, George Washington University, 
Washington, D. C. 


Cooperative Test Division, Educational Testing Service, 20 Nassau 
St, Princeton, New Jersey; 4640 Hollywood Boulevard, Los 


Angeles 27, California. 

Educational Test Bureau, Educational Publishers, 
ington Avenue, S.E., Minneapolis 14, Minnesota. 

C. A. Gregory Co., 345 Calhoun St., Cincinnati 19, Ohio. 

Houghton Mifflin Co., 2 Park St., Boston 7, Massachusetts. 

Ohio Scholarship Tests, Ohio State Department of Education, 
Columbus, Ohio. 

Personnel Press, Inc., 180 Nassau St., Princeton, New Jersey- 

552 Fifth Avenue, New York 36, New 


Inc., 720 Wash- 


Psychological Corporation, 
York. 

Public School Publishing Co., 204 West Mulberry St., Bloomington, 
Illinois. 


Science Research Associates, Inc., 57 West Gr 
Illinois. 


Sheridan Supply Co., P. O. Box 837, Beverly H 
Stanford University Press, Stanford, California. 
C. H. Stoelting Co., 424 No. Homan Ave., Chicago 24, Illinois. 
World Book Co., 313 Park Hill Ave., Yonkers 5, New York. 


and Ave., Chicago 10, 


ills, California. 


EXERCISES 


the following publishers. These are 


1. Send for the test catalogs of 
publishers in terms of the number 


some of the most important 
of tests which they publish. 
California Test Bureau 
Cooperative Test Division, Educational 
Testing Service 
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Educational Test Bureau 

Psychological Corporation 
є Science Research Associates 

World Book Company 


Study the catalogs and note the variety of standardized tests 
and inventories that are available. 


2. Select two or three standardized achievement tests in your own 

field from the Fourth Mental Measurements Yearbook or from 
the ‘publishers’ catalogs and send for a specimen set of each test. 
Be sure to have your order countersigned by your college instruc- 
tor or the test publisher probably will not send the test. It is 
a good idea to enclose the money with your order since the 
amount involved in usually small, and it is an imposition to 
make the company send a bill for such a small amount. 


cimen copies of the tests, take the 
andardized conditions described in 
y both the tests and the manuals. 

If the tests have been reviewed in the Fourth 
ments Yearbook read the reviews 
agree with the reviewers’ opinions 


3. When you receive your spe 
tests yourself under the st: 
the manual. Carefully stud 


Mental Measure- 
and see whether you would 
about the tests. 


SUGGESTED ADDITIONAL READINGS 


Greene, H.A., Jorgensen, A.N., and Gerberich, 
and Evaluation in the Secondary School. 

Thorndike, R.L., and Hagen, E. Measurement and Evaluation in 
Psychology and Education. 


J-R. Measurement 


APPENDIX A 


An Annotated Bibliography on 
Measurement and Evaluation 


T.L. Measurement and Evaluation for 
the Secondary School Teacher. New York: Dryden Press, 1956. 
Covers use of teacher-made and standardized measuring devices. 
Includes implications for corrective procedures. Separate chap- 
ters covering measurement and evaluation in the major subjects 
taught in the secondary schools. 
Adkins, D.C. Construction and Analysis of Achieve 
Washington, D.C.: U.S. Government Printing Office, 1947. 
A sound treatment of objective test construction. Written for 
the United States Civil Service Commission. 


Anastasi, A. Psychological Testing. New York: The Ma 
Company, 1954. 
Standard college text in psychological tes 
principles of test and measurement theory 
representative standardized tests. 
Arny, C.B. Evaluation in Home Economics. New York: Appleton- 
Century-Crofts, Inc., 1953. 


Methods of measuring 
special emphasis on and ex 


Adams, G.S., and Torgerson, 


ment Tests. 


cmillan 


ting. Includes basic 
and discussions of 


and evaluating student progress with 
amples from the field of home 


economics. 
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Bean, K.L. Construction of Educational and Personnel Tests. New 
York: McGraw-Hill Book Co., 1953. | 
Emphasizes the development of better tests. Includes sug 
gestions for preparing objective test items and essay questions. 


Bloom, B.S. A Taxonomy of Educational Objectives. New York: 
Longmans, Green & Co., 1956 mal ; 
A scholarly classification of educational objectives, together 
with illustrations of how the achievement of these objectives 
may be measured. 


Buros, O.K. (Editor). The Fourth Mental Measurements Yearbook. 
Highland Park, New Jersey: Gryphon Press, 1953. 
Lists and reviews the most commonly used standardized tests 


and inventories. The most important source of information 
about standardized tests and inventories. 


Clarke, H.H. Application of Measurement to Health and Physical 
Education. New York: Prentice-Hall, Inc., 1950. 
Covers methods of measuring and evaluating performance and 
knowledge in the field of health and physical education. 


Cronbach, L.J. Essentials of Psychological Testing. New York: 
Harper & Brothers, 1949. 


College text on psychological testing. Sound presentation of 
basic measurement theory. Covers basic concepts, tests of 
ability, and testing of typical performance. 


Ferguson, L.W. Personality Measurement. New York: McGraw-Hill 
Book Co., 1952. 


Discusses methods and representative tests and devices used in 
evaluating personality. 


Freeman, F.S. Theory and Practice of Psychological Testing (Re- 
vised) . New York: Henry Holt and Co., 1955. 
College text on psychological testin; 
cations. Includes coverage of tests 
Especially good coverage of intellige 


8. Covers theory and appli- 
of ability and. personality. 
nce tests. 
Garrett, H.E. Statistics in Psychology and Education (Fourth Edi- 
tion). New York: Longmans, Green & Co., 1958. 
Standard text on introductory statistics as applied to psychology 
and education. 


Gerberich, J.R. Specimen Objective Test Items. New York: Long- 
mans, Green & Co., 1956. 


‘Green, H.A., Jorgensen, A.N. 
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Primarily a collection of over 227 objective test items represent- 
ing a wide variety of subjects and objectives. Includes specimen 
objective test items for evaluating skills, knowledges, concepts, 
understandings, applications, activities, appreciation, attitudes, 
interests, and adjustment. A rich source of ideas for the con- 
struction of objective test items. 


Goodenough, F.L. Mental Testing. New York: Rinehart & Co., 1949. 
Includes sections on historical orientation, principles and 
methods, tests and scales, and applications. Emphasizes the im- 
portance of research in measurement. 


and Gerberich, J.R. Measurement 


and Evaluation in the Elementary School (Second Edition) . 


New York: Longmans, Green & Co., 1953. 
College text on measurement and evaluation. Covers teacher- 
made and standardized measures. Special chapters devoted to 
the evaluation of achievement in different areas of the elemen- 


tary school curriculum. 
— — Measurement and Evaluation in the Secondary School (Sec 
ond Edition). New York: Longmans, Green & Co., 1954. 
Secondary school version of the book listed abolve. Has much 
material that is also contained in the Elementary volume. In- 
cludes separate chapters on each major separate subject taught 
in the secondary schools. 


Henry, N.B. (Editor). Measurement of Un 
Yearbook, Part I, National Society for th 


Chicago: The University of Chicago Press, 1946. 
ure of understanding and techniques for 


ding. Includes sample test items for meas- 
ifferent subject areas. 


evised Edition) . 


derstanding. Forty-fifth 
e Study of Education. 


Emphasizes the nat 
evaluating understan 
uring understanding in many d 


Lindquist, E.F. A First Course in Statistics (R 
Boston: Houghton Mifflin Co., 1942. 
An introductory text in educational statistics. 


——— (Editor). Educational Measurement. Washington, D. C: 
American Council on Education, 1951. 
The most comprehensive and authoritative book on achieve- 


ment test construction available. Chapters written by 20 
evaluation. Includes sections 


authorities in measurement and 
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on the functions of measurement in education, the construction 
of achievement tests, and measurement theory. 


Magnuson, H.W, et al. Evaluating Pupil Progress. Bulletin of the 
California State Department of Education, XXI:6, 1952. 
Emphasizes the informal measurement and evaluation tech- 
niques that can be used by the classroom teacher. 


Micheels, W.J., and Karnes, M.R. Measuring Educational Achieve- 
ment. New York: McGraw-Hill Book Co., 1950. 

Covers standardized and teacher-made measures of ability and 

personality. Of particular value in coverage of product and 


procedures measurement. Good coverage of measurement and 
evaluation in industrial arts. 


Monroe, W.S. (Editor). Encyclopedia of Educational Research. 

(Revised Edition). New York: The Macmillan Company, 1950. 

Contains authoritative articles on many aspects of measure- 
ment and evaluation. 


Odell, C.W. How to Improve Classroom Testing. Dubuque, Iowa: 
William C. Brown Co., 1953. 
A relatively non-technical book on the construction and use of 
teacher-made tests. Covers objective and essay tests. 


Remmers, H.H., and Gage, N.L. Educational Measurement and 
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"Travers, R.M.W. Educational Measurement. New York: The Mac 
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book listed below. 


——— How to Make Achievement Tests. 


1950. 
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APPENDIX B 


More About Validity 
and Reliability 


Test validity 
A measuring device is valid if it measures what it is supposed to 
measure. Validity is sometimes referred to as the truthfulness of 
a measure. It is the most important characteristic of a test. Ifa 
test does not truly measure what it is supposed to measure, it is of 
dp value, regardless of its other good features. Sometimes a test is 
referred to as a “valid test.” In fact a test cannot be said to be valid 
In а general sense. It can only be valid for a particular purpose or 
purposes. A test which is a valid measure of achievement in social 
studies in the fifth grade in one community may not be a valid test 
of social sttdies in the fifth grade in an adjoining community. | 
There are two main types of validity. The first type is logical 
or rational validity. This type of validity is established by inspect 
ing the test itself to determine the extent to which the items in the 
test correspond to the objectives in the course or unit that is being 
evaluated. The second type is statistical or empirical. This type of 
being used to 


validity is employed when a measuring device is 
predict some kind of behavior. The measure of validity used here 


is generally a correlation coefficient between the test scores and the 
Scores obtained from the behavior which we are interested in pre- 
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dicting. This correlation coefficient is called a validity coefficient. 
Validity coefficients theoretically can take on the whole range of 
values from 1.00 (perfect positive correlation) through .00 (zero 
correlation) to —1.00 (perfect negative correlation). Validity coef- 
ficients of more than .50 are rare when a test is correlated with 
some criterion other than another test. This correlation still leaves 
a great deal to be desired when a test is used to predict what a 
person may do at some future time. For this reason test scores 
should always be interpreted as clues which may help predict future 
behavior rather than absolute predictors of behavior. 

Most tests used by classroom teachers are achievement tests. Tests 
of this type rely primarily on logical rather than empirical methods 
of determining the validity of the test for a particular purpose. In 
evaluating the validity of an achievement test, the proper procedure 
is to make an outline of the objectives and content of the course or 


unit which is being tested, and then compare these objectives and 
content with the test in question. 


Reliability 


The precision or consist 
its reliability. The 
Standard. error of measurement. The consiste. 
described by the reliability coefficient. 

The standard error of measurement reflects the extent to which 
repeated measurements of the same thing tend to cluster together. 
If the means of measurement are precise, the repeated measurements 
will tend to be similar to each other, and the standard error of 
measurement will be small. If the means of measurement is not 
precise, the repeated measurements would not be as similar, and the 
standard error of measurement will be large. 

If a boy actually weighs 95 pounds, and if he is weighed on an 
accurate scale 100 times in rapid succession, each of the weighings 
will be very close to his true weight. Sometimes the scale may read 
96 pounds and sometimes 94 pounds, but seldom, if ever, will it 
deviate from the true weight by more than one pound. If the same 
boy has a true I.Q. of 95 and if he is tested 100 times in rapid suc- 
cession,’ there would be considerably greater diversity of І.О). scores 


'This is a hypothetical exam 
error of measurement. In prac 
100 times. 


ple to illustrate the meaning of the standard 
lice, it is not possible to retest the same boy 


Reliability ul 


than of weights. The exact amount of scatter of the obtained scores 
would be expressed in terms of the standard error of measurement. 

Statisticians consider any measurement made to be only an esti- 
mate of the true measurement because they realize that some error 
is involved in all measurement procedures. Because of a relation- 
ship between the standard error of measurement and the “normal 
curve” it is possible to make probability statements about the true 
score of an individual based on the obtained score on a test and the 
size of the standard error of measurement. It is known that if the 
standard error of measurement is added to and substracted from an 
obtained measure, the chances are two out of three that the true 
score will be contained within the interval so formed. Thus, if it is 
known that the standard error of measurement on an intelligence 
test is 6 points’ and the score obtained by testing a pupil is 95, it is 
possible to form the interval 89-101 (by subtracting 6 from 95 and 
by adding 6 to 95) and to state that the chances are two out of three 
that the “true 1.0.” of the student is somewhere between 89 and 
101. There remains, of course, still one chance out of three that the 
true 1.Q. is somewhere outside of these limits, either greater than 
101 or less than 89. It is also known that if twice the standard 
error of measurement is added to and subtracted from an obtained 
score, a range of scores is obtained which will contain the true score 
nineteen out of twenty times. Thus, in the example above, the 
chances are nineteen out of twenty that the true I.Q. of the student 
was between 83 and 107. 

The second method of reporting the reliability of a test is the 
reliability coefficient. This coefficient is a correlation coefficient 
computed between scores obtained on two testings of the same 
group.” 

This coefficient reflects the degree to which the scores for the 
group tested tended to agree on the two occasions. If the scores 
on the two testings were in perfect agreement (an improbable situa- 
tion) , the reliability coefficient would be equal to 1.00. A reliability 
coefficient of .00 would indicate that there was no consistency of 
measurement, that there was no tendency for persons’ scores on the 
two testings to be close to each other. 


andard error of measurement is usually 
anual that accom- 


?Information about the size of a st 
furnished by the author of the test and will be found in the m 
panies the test. 

3 There are several different methods of computing the reliability of a test, but 
a discussion of these methods is beyond the scope of this book. See any of the 


standard texts for further discussion of this topic. 
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iabili efficients are usually much higher than validity co- 
Eu eere coefficient of .70 is low. A validity coefficient 
of .70 is very high. The question of how high a correlation coeffi- 
cient must be is difficult to answer. A test of relatively low reliablity 
(.60 to .70) can be useful if it is used to make comparisons between 
groups, since errors in the scores of individuals tend to cancel each 
other out. Thus, if a teacher wanted to compare the reading ability 
of two different classes, a short test of low reliability could be used. 
If, however, a decision had to be made about the grade placement 
of an individual pupil in the class, a test of higher reliability would 
be needed. Tests used to make decisions about individual pupils 
should have high reliability. 


Relationship between validity and reliability 


Test-makers who have no real evidence about the validity of their 
tests sometimes emphasize the high reliability of the test. It is there- 
fore important to understand the relationship which exists between 
the validity and reliability of a measuring device. This rel 
is expressed as follows: a test 
but a test cannot be valid 
reliability is a necessary prere 
guarantee validity. In fact, 


ationship 
may be reliable without being valid, 
without being reliable. Therefore, 
quisite to validity, but does not in itself 


» it is possible for a measure to be 
extremely reliable and yet possess no validity whatsoever for a 


particular purpose. For example, the circumference of the head 
can be measured with great precision; however, this measure has 
been shown to have no significant correlation with intelligence, 
and thus is not a valid measure for the purpose of estimating 
intelligence. This does not mean that the circumference of the head 
is not a valid measure in general. After all, it is a perfectly valid 
measure for determining the size hat a person should buy. 
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